Table of Papers 2024

Arxiv ID	Title	Authors	Abstract	What	Why	How	Result	LF	Tags
2406.07550 Report	An Image is Worth 32 Tokens for Reconstruction and Generation	Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen	Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.	This paper introduces TiTok, a novel 1D tokenization method that represents images as compact 1D latent sequences for efficient image reconstruction and generation, breaking away from traditional 2D grid-based representations.	Existing 2D tokenization methods struggle to handle redundancies in images, limiting their ability to create highly compressed representations. TiTok overcomes this by leveraging the inherent redundancy in images to achieve significantly more compact and efficient representations.	TiTok utilizes a Vision Transformer (ViT) encoder and decoder with a vector quantizer. It encodes image patches concatenated with latent tokens into a 1D sequence, which is then quantized. A ViT decoder reconstructs the image from these quantized tokens and mask tokens.	As few as 32 tokens can effectively represent an image, achieving comparable reconstruction performance to 2D methods using 256 tokens. Scaling up the tokenizer model size allows for even more compact representations without sacrificing performance. 1D tokenization leads to faster and better generative training, achieving competitive FID scores with significantly reduced training and inference time.	The current implementation primarily focuses on VQ tokenization and a Masked Transformer generator; exploring other tokenizer/generator combinations is left for future work. While the paper demonstrates results on image data, extending the applicability of 1D tokenization to other modalities like video is a potential future direction.	image tokenization, 1d representation, image generation, vision transformer, vector quantization
2406.07547 Report	Zero-shot Image Editing with Reference Imitation	Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, Hengshuang Zhao	Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.	Introduces 'imitative editing', a new image editing paradigm where users simply provide a masked source image and an unmasked reference image, enabling editing by imitating corresponding parts from the reference.	Addresses limitations of existing editing tools that rely heavily on textual descriptions or struggle with local component editing, providing a more convenient and intuitive editing experience.	Presents MIMIC, a framework trained on video frames using dual diffusion U-Nets. It learns to locate and imitate corresponding regions from a reference image to fill masked areas in a source image, ensuring harmonious blending.	MIMIC outperforms existing inpainting and composition methods qualitatively and quantitatively in terms of fidelity and harmonious blending. A new benchmark for evaluating imitative editing is introduced, focusing on part composition and texture transfer tasks. Ablation studies confirm the importance of video-based training, data augmentation, and the use of dual U-Net architecture for optimal performance.	MIMIC may struggle to identify the correct reference region when it's too small or multiple similar candidates exist in the reference. Future work will focus on improving reference region localization and extending MIMIC's capabilities to handle multiple reference images.	image editing, imitative editing, diffusion models, semantic correspondence, image composition
2406.07540 Report	Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance	Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou	Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x	\controlx is a training-free and guidance-free framework for structure and appearance control of text-to-image and text-to-video diffusion models.	Existing methods for controlling the structure and appearance of diffusion models often require extensive training or computationally expensive guidance techniques, limiting their flexibility and efficiency.	\controlx leverages feature injection and spatially-aware normalization in the attention layers of pretrained diffusion models to align generated images with user-provided structure and appearance images. Structure control is achieved through direct feature injection from a noisy structure latent, while appearance transfer utilizes self-attention correspondence to normalize output features with weighted feature statistics from a noisy appearance latent.	\controlx accurately preserves structure from various input types, including natural images, ControlNet-supported conditions, and in-the-wild conditions not possible with existing training-based methods. \controlx effectively transfers appearance from a given image, demonstrating superior performance compared to training-based and guidance-based methods, especially in challenging cases like cross-subject appearance transfer. Being both training-free and guidance-free, \controlx achieves competitive runtimes comparable to training-based methods while being significantly faster than other guidance-based and guidance-free approaches.	The semantic-aware appearance transfer may struggle to capture target appearance when the subject is small due to the low resolution of the feature map. While \controlx inherits the same safeguards as the T2I and T2V models it builds upon, its accessibility could potentially be misused for malicious applications, raising ethical concerns regarding consent and artist credit.	generative models, text-to-image synthesis, diffusion models, controllable image generation, appearance transfer
2406.07537 Report	Autoregressive Pretraining with Mamba in Vision	Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie	The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.	This paper introduces ARM, a novel autoregressive pretraining strategy tailored for Mamba architectures in computer vision, enhancing their visual capabilities, scalability, and benchmark performance.	Prior Mamba architectures for vision, while promising, faced limitations in transferability, scalability, and struggled to match the success of autoregressive pretraining in NLP.	The paper introduces ARM, which leverages the inherent unidirectional nature of Mamba for efficient autoregressive pretraining, using clustered image patches as prediction units for enhanced performance.	ARM significantly boosts ImageNet accuracy, with ARM-B achieving 83.2%, outperforming its supervised counterpart by 2.0% and previous Mamba variants. ARM enables successful training of the largest vision Mamba model to date (ARM-H) reaching 85.0% accuracy on ImageNet. ARM enhances robustness, with significant performance gains over supervised counterparts on out-of-domain ImageNet variants like ImageNet-A, ImageNet-R, and ImageNet-S.	The study primarily focuses on image classification, leaving its application to other vision tasks for future work. Exploring more complex pretraining strategies or incorporating additional data augmentations could further enhance ARM’s performance.	autoregressive pretraining, vision mamba, self-supervised learning, image classification, computer vision
2406.07524 Report	Simple and Effective Masked Diffusion Language Models	Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov	While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: https://github.com/kuleshov-group/mdlm	The paper presents a well-engineered masked discrete diffusion language modeling (MDLM) framework that outperforms existing diffusion models on language modeling benchmarks, approaching the perplexity of autoregressive (AR) models.	Diffusion models have the potential to improve long-term planning, controllable generation, and sampling speed in language modeling, but previous approaches exhibit a performance gap compared to AR models.	The authors utilize a simplified, Rao-Blackwellized objective and a substitution-based parameterization of the reverse diffusion process, along with efficient samplers that support semi-autoregressive generation.	MDLM achieves a new state-of-the-art among diffusion models on language modeling benchmarks, including One Billion Words and OpenWebText. Simple engineering choices significantly improve the performance of MDLM and previously discounted baselines like D3PM. The MDLM framework extends to non-language domains, achieving comparable or superior downstream performance to classical BERT-style training on DNA sequence modeling.	MDLM perplexity remains slightly higher than AR models. Future work includes exploring more sophisticated denoising network architectures and extending the framework to other discrete data domains.	diffusion models, language modeling, rao-blackwellization, semi-autoregressive generation, dna sequence modeling
2406.07520 Report	Neural Gaffer: Relighting Any Object via Diffusion	Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, Noah Snavely	Single-image relighting is a challenging task that involves reasoning about the complex interplay between geometry, materials, and lighting. Many prior methods either support only specific categories of images, such as portraits, or require special capture conditions, like using a flashlight. Alternatively, some methods explicitly decompose a scene into intrinsic components, such as normals and BRDFs, which can be inaccurate or under-expressive. In this work, we propose a novel end-to-end 2D relighting diffusion model, called Neural Gaffer, that takes a single image of any object and can synthesize an accurate, high-quality relit image under any novel environmental lighting condition, simply by conditioning an image generator on a target environment map, without an explicit scene decomposition. Our method builds on a pre-trained diffusion model, and fine-tunes it on a synthetic relighting dataset, revealing and harnessing the inherent understanding of lighting present in the diffusion model. We evaluate our model on both synthetic and in-the-wild Internet imagery and demonstrate its advantages in terms of generalization and accuracy. Moreover, by combining with other generative methods, our model enables many downstream 2D tasks, such as text-based relighting and object insertion. Our model can also operate as a strong relighting prior for 3D tasks, such as relighting a radiance field.	This paper introduces Neural Gaffer, an end-to-end 2D relighting diffusion model capable of relighting objects from arbitrary categories under novel environmental lighting conditions specified as HDR environment maps.	Single-image relighting is challenging due to the complex interplay between geometry, materials, and lighting, with prior methods often limited to specific object categories or requiring special capture conditions.	The method leverages a pre-trained diffusion model fine-tuned on a synthetic relighting dataset (RelitObjaverse) derived from Objaverse. Key innovations include rotating the target environment map to align with the target camera frame and a novel HDR-LDR conditioning strategy to effectively encode the full lighting energy spectrum.	Neural Gaffer exhibits superior generalization and accuracy in single-image relighting compared to recent baselines, accurately reproducing highlights, shadows, and reflections. The model effectively supports downstream 2D tasks such as text-based relighting and object insertion, demonstrating its versatility. Neural Gaffer serves as a powerful relighting prior for 3D tasks, enabling high-quality relighting of neural radiance fields within minutes using a proposed two-stage pipeline.	The model may exhibit minor inconsistencies in relighting results under changing lighting conditions due to its generative nature. The reliance on a low-resolution backbone diffusion model limits handling higher image resolutions	relighting, diffusion models, neural radiance fields, image editing, computer vision
2406.07516 Report	Instant 3D Human Avatar Generation using Image Diffusion Models	Nikos Kolotouros, Thiemo Alldieck, Enric Corona, Eduard Gabriel Bazavan, Cristian Sminchisescu	We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning to solve tasks such as image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup w.r.t. the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls, thus enabling applications that require the controlled 3D generation of human avatars at scale. The project website can be found at https://www.nikoskolot.com/avatarpopup/.	AvatarPopUp, a method for instant generation of rigged full-body 3D human avatars from text, images, and/or body pose and shape.	Existing text-to-3D human generation methods are optimization-based, taking minutes to hours per instance, while image-based methods lack control and diversity. AvatarPopUp closes this gap by enabling instant, controllable, and diverse 3D human avatar creation.	AvatarPopUp decouples the generation process into two stages: (1) Text-to-image generation using fine-tuned Latent Diffusion models, conditioned on text prompts and optionally body pose and shape. (2) 3D lifting using a unimodal, feed-forward image-to-3D model trained on a smaller dataset, taking front and back views (generated or input) as input.	AvatarPopUp generates high-quality, diverse 3D avatars consistent with text prompts and body controls in 2-10 seconds. It achieves state-of-the-art performance in single-image 3D reconstruction, outperforming baselines in generating detailed back views and normals. The method enables 3D virtual try-on applications, preserving identity while allowing garment editing with realistic wrinkles and details.	Limitations inherent to pixel-aligned methods persist, with less detailed regions parallel to camera rays and potential artifacts in under-represented poses or clothing. Future work includes exploring alternative 3D construction strategies beyond pixel-aligned features and expanding applications in various fields.	3d human avatar generation, text-to-3d, image-to-3d, diffusion models, virtual try-on
2406.07502 Report	Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions	Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang	Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.	The paper proposes ImageTell (IT), a framework for automatically generating detailed and accurate image descriptions without human intervention.	High-quality image descriptions are crucial for various applications, but existing datasets are limited by low quality (web-scraped) or high annotation cost (human-labeled).	IT leverages MLLMs to generate a base description, uses vision expert models to extract fine-grained details and detect hallucinations, and finally employs LLMs to reconstruct a richer and more accurate description based on textual information.	IT-generated descriptions outperform MLLM-generated descriptions in capturing comprehensive visual information and reducing hallucinations. Fine-tuning MLLMs with IT-curated data significantly improves their ability to generate detailed and accurate descriptions, approaching the performance of more powerful MLLMs. Evaluation benchmarks (DID-Bench, D2I-Bench, LIN-Bench) are proposed to assess the quality of detailed descriptions.	Tuning larger MLLMs with IT-curated data was not explored due to computational limitations. Future work could investigate incorporating additional vision experts and exploring different recaptioning strategies.	image description generation, multi-modal large language models, vision expert models, hallucination detection, detailed image description datasets
2406.07499 Report	Trim 3D Gaussian Splatting for Accurate Geometry Representation	Lue Fan, Yuxue Yang, Minxing Li, Hongsheng Li, Zhaoxiang Zhang	In this paper, we introduce Trim 3D Gaussian Splatting (TrimGS) to reconstruct accurate 3D geometry from images. Previous arts for geometry reconstruction from 3D Gaussians mainly focus on exploring strong geometry regularization. Instead, from a fresh perspective, we propose to obtain accurate 3D geometry of a scene by Gaussian trimming, which selectively removes the inaccurate geometry while preserving accurate structures. To achieve this, we analyze the contributions of individual 3D Gaussians and propose a contribution-based trimming strategy to remove the redundant or inaccurate Gaussians. Furthermore, our experimental and theoretical analyses reveal that a relatively small Gaussian scale is a non-negligible factor in representing and optimizing the intricate details. Therefore the proposed TrimGS maintains relatively small Gaussian scales. In addition, TrimGS is also compatible with the effective geometry regularization strategies in previous arts. When combined with the original 3DGS and the state-of-the-art 2DGS, TrimGS consistently yields more accurate geometry and higher perceptual quality. Our project page is https://trimgs.github.io	Presents TrimGS, a novel technique for reconstructing accurate 3D geometry from images using a contribution-based Gaussian trimming strategy, complementing existing geometric regularization methods.	Addresses the limitations of previous 3D Gaussian Splatting (3DGS) methods that rely heavily on geometric regularization, which often struggle to capture intricate geometric details.	Introduces a novel contribution-based trimming strategy that selectively removes inaccurate or redundant Gaussians based on their contributions to the rendered images. It also proposes maintaining relatively small Gaussian scales during training to enhance detail representation and optimize high-frequency regions.	TrimGS, when applied to both 3DGS and 2DGS, consistently produces more accurate geometry as measured by Chamfer Distance on the DTU dataset. Analysis of raw point clouds (Gaussian centers) demonstrates the effectiveness of TrimGS in preserving geometric details. TrimGS, particularly when combined with 2DGS, enhances the perceptual rendering quality, especially in high-frequency regions, mitigating the over-smoothness often observed in 2DGS.	Despite emphasizing Gaussian trimming, TrimGS still relies on geometric regularization, which can slightly compromise rendering quality compared to the original 3DGS. Future work will explore the challenge of simultaneously achieving high rendering fidelity and accurate geometry reconstruction.	3d gaussian splatting, geometry reconstruction, novel view synthesis, gaussian trimming, perceptual rendering quality
2406.07496 Report	TextGrad: Automatic "Differentiation" via Text	Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou	AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.	Introduces TextGrad, a framework for automatic differentiation via text, which uses textual feedback from LLMs to optimize components of compound AI systems.	Addresses the challenge of optimizing complex AI systems composed of multiple LLMs and tools, a task not easily addressed by traditional gradient-based methods.	Represents AI systems as computation graphs where variables are connected by arbitrary functions (e.g., LLM calls, simulators). Employs LLMs to provide natural language feedback ('textual gradients') on how to modify variables to improve a downstream objective. These gradients are backpropagated through the graph to update variables.	Improves the zero-shot accuracy of GPT-4 on the Google-Proof Question Answering benchmark from 51% to 55%. Achieves a 20% relative performance gain in optimizing solutions to LeetCode-Hard coding problems compared to existing methods. Enhances prompts for reasoning tasks, pushing GPT-3.5 performance closer to GPT-4.	Current implementation primarily focuses on text data; extending to other data modalities is important future work. Exploring more sophisticated optimization techniques inspired by the numerical optimization literature could further improve performance and stability.	large language models, automatic differentiation, compound ai systems, optimization, textual feedback
2406.07488 Report	ReduceFormer: Attention with Tensor Reduction by Summation	John Yang, Le An, Su Inn Park	Transformers have excelled in many tasks including vision. However, efficient deployment of transformer models in low-latency or high-throughput applications is hindered by the computation in the attention mechanism which involves expensive operations such as matrix multiplication and Softmax. To address this, we introduce ReduceFormer, a family of models optimized for efficiency with the spirit of attention. ReduceFormer leverages only simple operations such as reduction and element-wise multiplication, leading to greatly simplified architecture and improved inference performance, with up to 37% reduction in latency and 44% improvement in throughput, while maintaining competitive accuracy comparable to other recent methods. The proposed model family is suitable for edge devices where compute resource and memory bandwidth are limited, as well as for cloud computing where high throughput is sought after.	Introduces ReduceFormer, a family of efficient vision transformer models utilizing simple operations like reduction and element-wise multiplication to improve efficiency without significant accuracy loss.	Addresses the computational and memory challenges of deploying transformer models in low-latency or high-throughput applications, particularly on resource-constrained edge devices.	Replaces complex attention mechanisms with a combination of multi-scale local context learning and ReduceFormer Attention, which leverages global summation and element-wise operations to approximate global feature relationships.	Achieves competitive accuracy comparable to other state-of-the-art methods like EfficientViT on ImageNet-1K benchmark. Demonstrates significant speedup, with up to 37% reduction in latency on NVIDIA DRIVE Orin SoC and up to 44% improvement in throughput on L40 GPU compared to EfficientViT. Reduces memory footprint, making it suitable for deployment on edge devices with limited memory bandwidth.	Current work focuses on image classification, leaving exploration of other vision tasks for future research. Further optimization of ReduceFormer for specific hardware platforms could potentially yield additional performance gains.	vision transformers, efficient deep learning, attention mechanisms, edge computing, computer vision
2406.07480 Report	Image Neural Field Diffusion Models	Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi	Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.	This paper proposes Image Neural Field Diffusion models (INFD), which learn the distribution of continuous images via neural fields, enabling resolution-agnostic image generation and editing.	Current diffusion models are limited to fixed-resolution image generation, requiring separate super-resolution models for high-resolution synthesis. INFD overcomes this by learning a continuous image representation, facilitating efficient high-resolution generation and multi-scale image editing.	The method involves two stages: 1) Train an image neural field autoencoder that maps images to and from a latent space representing continuous image neural fields. 2) Train a diffusion model on this latent space to generate new images. A novel convolutional local image function (CLIF) renderer is introduced for efficient and photorealistic neural field rendering.	INFD outperforms fixed-resolution diffusion models followed by super-resolution in terms of image quality and detail preservation. The model effectively learns from mixed-resolution datasets, even with limited high-resolution training data. INFD enables efficient solving of inverse problems with multi-scale conditions, such as layout-to-image generation.	The method assumes scale-consistency in training data, limiting its performance on datasets with significant discrepancies between low and high-resolution images. Current implementation relies on a fixed-resolution encoder, exploring efficient any-resolution encoders is left for future work.	diffusion models, neural fields, image generation, super-resolution, image editing
2406.07476 Report	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing	In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.	Introducing VideoLLaMA 2, a series of Video Large Language Models (Video-LLMs) that improve spatial-temporal modeling and audio understanding in video and audio-oriented tasks.	Video understanding and generation is an important AI field, and existing Video-LLMs struggle to process temporal dynamics and integrate audio cues effectively.	VideoLLaMA 2 builds on its predecessor with a new Spatial-Temporal Convolution (STC) connector for capturing spatial and temporal dynamics. It also integrates an Audio Branch through joint training to incorporate audio cues, enhancing multimodal understanding.	VideoLLaMA 2 achieves competitive results against open-source models on MC-VQA, OE-VQA, and VC tasks, even approaching the performance of some proprietary models. The model exhibits significant improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks. VideoLLaMA 2 showcases a deeper understanding of multimodal content, excelling in tasks requiring interpretation of both visual and auditory information.	The model shows limitations in video tasks heavily reliant on static visual information, suggesting a potential area for improvement. Future work could explore the integration of other popular LLMs, like Gemma-IT, LLaMA3-Instruct, and Qwen2-Instruct, as the backbone.	video language models, multimodal understanding, spatial-temporal modeling, audio-visual integration, video question answering
2406.07472 Report	4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models	Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee	Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.	Introduces \methodname, a novel pipeline for photorealistic text-to-4D scene generation that leverages video generative models trained on real-world datasets.	Addresses limitations of existing 4D generation methods, which often produce object-centric and unrealistic scenes due to reliance on multi-view models trained on synthetic data.	Generates a reference video and a freeze-time video using a video diffusion model. Reconstructs canonical 3D Gaussian Splats (3DGS) from the freeze-time video, modeling inconsistencies as per-frame deformations. Learns temporal deformation from the reference video using the reconstructed 3DGS and video score distillation sampling (SDS).	Achieves text-driven dynamic scene generation with near-photorealistic appearance and realistic 3D motions. Outperforms state-of-the-art object-centric text-to-4D generation methods in user studies across various realism and quality metrics. Offers greater flexibility, diversity, and computational efficiency compared to methods relying solely on score distillation sampling.	Inherits limitations from the underlying video generation model, such as resolution constraints and occasional artifacts. Reconstruction can be challenging with complex scenes involving rapid movements or lighting changes.	4d scene generation, text-to-video, deformable 3d gaussian splats, score distillation sampling, photorealistic rendering
2406.07251 Report	Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models	Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roderick Murray-Smith, Daniele Faccio	In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolution as a baseline to sample at higher resolutions. For the guidance, we introduce the Slider, a tunable mechanism that fuses the overall structure contained in the first-generated image with enhanced fine details. At each inference step, we denoise patches rather than the entire latent space, minimizing memory demands such that a single GPU can handle the process, regardless of the image's resolution. Our experimental results show that Pixelsmith not only achieves higher quality and diversity compared to existing techniques, but also reduces sampling time and artifacts. The code for our work is available at https://github.com/Thanos-DB/Pixelsmith.	Pixelsmith, a zero-shot text-to-image framework that generates gigapixel images using a single consumer-grade GPU by scaling the output of pretrained diffusion models.	Existing methods for high-resolution image generation are limited by computational resources, memory efficiency, and the introduction of artifacts. Pixelsmith addresses these limitations.	The method uses a cascaded approach, generating a base image at a lower resolution and then upscaling it. It introduces a "Slider" mechanism to control the level of detail and a patch denoising process for memory efficiency.	Pixelsmith achieves higher quality and diversity compared to existing techniques. The method reduces sampling time and artifacts. It allows for flexible scaling to ultra-high resolutions on limited hardware.	Suppressing artifacts becomes increasingly difficult at higher resolutions. Lack of appropriate metrics for evaluating high-resolution image generation.	image generation, diffusion models, high-resolution, gigapixel, patch denoising
2406.07209 Report	MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance	X. Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang	Recent advancements in text-to-image generation models have dramatically enhanced the generation of photorealistic images from textual prompts, leading to an increased interest in personalized text-to-image applications, particularly in multi-subject scenarios. However, these advances are hindered by two main challenges: firstly, the need to accurately maintain the details of each referenced subject in accordance with the textual descriptions; and secondly, the difficulty in achieving a cohesive representation of multiple subjects in a single image without introducing inconsistencies. To address these concerns, our research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. This innovative approach integrates grounding tokens with the feature resampler to maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion further improves the cross-attention to adapt to the multi-subject inputs, ensuring that each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving the control of texts. Comprehensive quantitative and qualitative experiments affirm that this method surpasses existing models in both image and text fidelity, promoting the development of personalized text-to-image generation.	MS-Diffusion, a novel layout-guided, zero-shot framework for multi-subject image personalization within diffusion models.	Addresses limitations of existing personalized text-to-image generation models in maintaining subject detail fidelity and achieving cohesive multi-subject representation.	Introduces a grounding resampler to extract and integrate subject features with grounding information (entities, bounding boxes) and a multi-subject cross-attention mechanism to confine subjects to specific areas guided by layout priors.	Achieves superior image fidelity and detail retention in single-subject personalization. Demonstrates robust multi-subject generation with natural interactions and distinct subject representation. Exhibits high text fidelity, preserving text control capabilities while incorporating multiple subject references.	Lacks precision in subject positioning due to box-based layout indication. Explicit layout input requirement during inference limits complex scene generation.	image-personalization, multi-subject-generation, diffusion-models, zero-shot-learning, layout-guidance
2406.07170 Report	VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation	Sidun Liu, Peng Qiao, Zongxin Ye, Wenyu Li, Yong Dou	Neural Surface Reconstruction learns a Signed Distance Field~(SDF) to reconstruct the 3D model from multi-view images. Previous works adopt voxel-based explicit representation to improve efficiency. However, they ignored the gradient instability of interpolation in the voxel grid, leading to degradation on convergence and smoothness. Besides, previous works entangled the optimization of geometry and radiance, which leads to the deformation of geometry to explain radiance, causing artifacts when reconstructing textured planes. In this work, we reveal that the instability of gradient comes from its discontinuity during trilinear interpolation, and propose to use the interpolated gradient instead of the original analytical gradient to eliminate the discontinuity. Based on gradient interpolation, we propose VoxNeuS, a lightweight surface reconstruction method for computational and memory efficient neural surface reconstruction. Thanks to the explicit representation, the gradient of regularization terms, i.e. Eikonal and curvature loss, are directly solved, avoiding computation and memory-access overhead. Further, VoxNeuS adopts a geometry-radiance disentangled architecture to handle the geometry deformation from radiance optimization. The experimental results show that VoxNeuS achieves better reconstruction quality than previous works. The entire training process takes 15 minutes and less than 3 GB of memory on a single 2080ti GPU.	VoxNeuS, an efficient and lightweight neural surface reconstruction method that enhances voxel-based neural surface reconstruction by using interpolated gradients.	Existing voxel-based methods for neural surface reconstruction suffer from gradient instability during trilinear interpolation, leading to slow convergence and poor surface smoothness. Additionally, the entanglement of geometry and radiance optimization can cause artifacts.	The core of VoxNeuS is the replacement of analytical SDF gradients with interpolated gradients, ensuring gradient continuity without computational overhead. Additionally, it uses a geometry-radiance disentangled architecture, directly applies SDF regularization on vertices, and employs progressive super-resolution of the SDF grid.	Achieves lower Chamfer Distance on DTU than Voxurf and NeuS2 without requiring foreground masks. Significantly faster than previous methods, completing training in 15 minutes on a single 2080ti GPU. More memory efficient, requiring less than 3GB of memory during training.	Relies on an independent color network (e.g., hash grid), which could be further optimized. The disentangled architecture may require more iterations to converge compared to entangled approaches.	neural surface reconstruction, voxel-based representation, gradient interpolation, geometry-radiance disentanglement, efficient 3d reconstruction
2406.07163 Report	FaceGPT: Self-supervised Learning to Chat about 3D Human Faces	Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski	We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.	Introduces FaceGPT, a self-supervised framework allowing Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text using a 3D Morphable Model (3DMM) embedded within the VLM's token space.	Traditional 3D face reconstruction methods lack semantic reasoning, limiting their ability to understand faces from textual descriptions like humans can. FaceGPT bridges this gap by enabling VLMs to process both visual and textual information for 3D face understanding.	The framework integrates a 3DMM into a VLM, enabling the generation of 3D faces from both text and images. It leverages a self-supervised, model-based autoencoder trained on in-the-wild images, using a differentiable renderer to reconstruct and learn from 2D face images, eliminating the need for expensive 3D annotations.	Achieves high-quality 3D face reconstructions comparable to specialized methods while retaining general visual instruction following abilities. Demonstrates the ability to generate 3D faces from complex textual descriptions, opening new avenues in human face analysis. Exhibits strong performance in traditional 3D face reconstruction, visual instruction following, and text-based 3D face reconstruction tasks.	Currently does not match the state-of-the-art performance of specialized 3D face reconstruction methods. Specific to faces and requires a pre-existing 3D morphable model, limiting generalization to other objects.	3d face reconstruction, vision-language models, self-supervised learning, 3d morphable model, text-to-3d face generation
2406.07008 Report	Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models	Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh	As pretrained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. In this paper, we introduce a method to produce results with the same structure of a target image but painted with colors from a reference image, i.e., appearance transfer, especially following the semantic correspondence between the result and the reference. E.g., the result wing takes color from the reference wing, not the reference head. Existing methods rely on the query-key similarity within self-attention layer, usually producing defective results. To this end, we propose to find semantic correspondences and explicitly rearrange the features according to the semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the color from the reference according to the semantic correspondences, even when the two images are not aligned.	This paper introduces a training-free method for appearance transfer in text-to-image diffusion models, which transfers local appearances from a reference image to a target image based on their semantic correspondences.	Existing methods often fail to accurately transfer appearance between unaligned images or those with complex patterns because they rely on query-key similarity within self-attention layers, which doesn’t guarantee semantic correspondence.	The method finds semantic correspondences between features of the target and reference images, rearranges the reference features accordingly, and injects them into the target features during the denoising process. They utilize image-level segmentation masks to confine the correspondence within the region of interest and apply AdaIN to minimize color discrepancies.	The method successfully transfers complex color patterns while preserving the target image's structure, even when the target and reference images are not aligned. It outperforms existing methods in preserving the structure of the target image, as measured by IoU between object masks. The method is robust to challenging cases, such as cross-category and cross-style appearance transfer, and can handle multiple objects with different appearances.	The method relies on the performance of the inversion model and may struggle if the inversion is inaccurate. It may not always find accurate semantic correspondences when the reference image lacks semantically matching parts with the target image.	appearance transfer, diffusion models, semantic correspondence, image editing, text-to-image synthesis
2406.06973 Report	RWKV-CLIP: A Robust Vision-Language Representation Learner	Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng	Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP	This paper explores CLIP from data and model architecture perspectives, proposing a diverse description generation framework using LLMs and introducing RWKV-CLIP, an RWKV-driven vision-language representation learning model.	This work addresses challenges of noisy data in large-scale image-text pairs and limitations of Transformers in processing high-resolution images and long sequences.	The authors develop a diverse description generation framework leveraging LLMs to synthesize and refine information from various sources. They also propose RWKV-CLIP, which combines the parallel training of Transformers with the efficient inference of RNNs.	RWKV-CLIP achieves state-of-the-art performance in linear probe, surpassing previous models. It significantly outperforms existing methods in zero-shot image-text retrieval on Flickr30k and MSCOCO. The model demonstrates robustness and effectiveness in zero-shot classification across 11 datasets.	The paper notes potential limitations in prompt template constraints affecting zero-shot classification. Future work could explore further compatibility improvements between RWKV and Transformer architectures.	vision-language representation learning, contrastive language-image pre-training (clip), rwkv, diverse description generation, zero-shot learning
2406.06911 Report	AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising	Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang	Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.	This paper proposes AsyncDiff, a universal and plug-and-play distributed acceleration scheme for diffusion models, enabling model parallelism across multiple devices.	Diffusion models have high inference latency due to their multi-step sequential denoising process, hindering their widespread application.	AsyncDiff divides the denoising model into components, each assigned to a different device. By exploiting hidden state similarity between consecutive steps, it transforms sequential denoising into an asynchronous process, allowing parallel computation.	Achieves up to 4.0x speedup on Stable Diffusion v2.1 with minimal quality degradation on four NVIDIA A5000 GPUs. Demonstrates effectiveness on both text-to-image and video diffusion models, significantly reducing latency while preserving quality. Outperforms existing parallel acceleration methods in terms of speed, quality, and resource efficiency.	Performance may be sub-optimal with limited communication bandwidth between devices. Relies on pre-trained diffusion models, limiting quality improvements if the baseline model is inadequate.	diffusion models, model parallelism, asynchronous denoising, inference acceleration, distributed computing
2406.06890 Report	Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation	Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang	Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data.	Proposes MCM, a single-stage video diffusion distillation method that accelerates sampling and leverages an optional high-quality image dataset to enhance generated video frame quality.	Existing video diffusion models often suffer from unsatisfactory frame quality due to limitations in publicly available video datasets. This hinders both teacher and student model performance.	Combines a video Latent Consistency Model (LCM) for motion distillation with an image discriminator for appearance enhancement, addressing conflicting learning objectives and training-inference discrepancies through disentangled motion distillation and mixed trajectory distillation.	Significantly improves video diffusion distillation performance compared to previous state-of-the-art methods. Demonstrates superior adaptability to different image dataset distributions, resulting in higher fidelity and aesthetically pleasing video frames. Effectively mitigates training-inference discrepancies through simulating inference-time ODE trajectories and mixing them with real video data during training.	Model performance is sensitive to the quality, diversity, and distribution of training data. Potential for misuse in creating deepfakes necessitates responsible deployment strategies.	video diffusion, diffusion distillation, frame quality enhancement, text-to-video generation, motion consistency
2406.06820 Report	Adapters Strike Back	Jan-Martin O. Steitz, Stefan Roth	Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.	This paper presents \textbf{\adapter}, an improved adapter configuration for adapting vision transformers (ViTs) for downstream tasks, showing that adapters can outperform other parameter-efficient fine-tuning methods.	Fine-tuning large ViTs on multiple downstream tasks requires significant storage and risks overfitting on small datasets. Parameter-efficient tuning methods address these issues, and \adapter offers an optimal solution.	The study investigates the impact of adapter position, inner structure (normalization, scaling, initialization), and pre-processing on ViT adaptation using VTAB and FGVC benchmarks.	The \textbf{Post-Adapter} position, with \textbf{channel-wise scaling} and \textbf{Houlsby initialization}, proves to be the most effective adapter configuration. \adapter achieves state-of-the-art average accuracy on VTAB (77.6%) without per-task hyperparameter tuning and on FGVC (90.7%). \adapter demonstrates superior parameter-accuracy trade-off and robustness to domain shifts compared to LoRA, VPT, SSF, FacT, and other methods.	The study primarily focuses on a ViT-B/16 architecture. Future work could explore \adapter's performance on larger ViT models and with different pre-training strategies.	vision transformer, transfer learning, parameter-efficient fine-tuning, adapter, vtab
2406.06730 Report	TRINS: Towards Multimodal Language Models that Can Read	Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun	Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.	This paper introduces TRINS, a text-rich image instruction dataset, to improve multimodal language models' ability to understand and reason about text within images.	Existing visually-tuned models struggle to comprehend text in images due to limitations in training data, hindering their ability to understand documents, posters, etc., and limiting human-agent collaboration.	TRINS is built using a semi-automatic approach, leveraging CLIP and GPT-4 for annotation, resulting in 39k+ text-rich images with captions and 100k+ question-answer pairs. The authors also introduce LaRA, a language-vision reading assistant model.	TRINS annotations are significantly more detailed than existing datasets, leading to improved performance in text-rich image understanding tasks. LaRA, fine-tuned on TRINS, outperforms state-of-the-art models on text-rich image understanding, demonstrating the dataset's effectiveness. Fine-tuning on TRINS does not degrade performance on general visual tasks, suggesting a broader benefit to multimodal understanding.	The ability to extract text from images, while improved by OCR integration, remains a limitation for LaRA. Generating images with extensive text remains challenging for existing text-to-image models, necessitating further research in text rendering.	multimodal learning, computer vision, natural language processing, dataset, text recognition
2406.06527 Report	IllumiNeRF: 3D Relighting without Inverse Rendering	Xiaoming Zhao, Pratul P. Srinivasan, Dor Verbin, Keunhong Park, Ricardo Martin Brualla, Philipp Henzler	Existing methods for relightable view synthesis -- using a set of images of an object under unknown lighting to recover a 3D representation that can be rendered from novel viewpoints under a target illumination -- are based on inverse rendering, and attempt to disentangle the object geometry, materials, and lighting that explain the input images. Furthermore, this typically involves optimization through differentiable Monte Carlo rendering, which is brittle and computationally-expensive. In this work, we propose a simpler approach: we first relight each input image using an image diffusion model conditioned on lighting and then reconstruct a Neural Radiance Field (NeRF) with these relit images, from which we render novel views under the target lighting. We demonstrate that this strategy is surprisingly competitive and achieves state-of-the-art results on multiple relighting benchmarks. Please see our project page at https://illuminerf.github.io/.	This paper introduces a novel method for relightable 3D reconstruction that leverages a 2D Relighting Diffusion Model (RDM) and a latent NeRF model, departing from conventional inverse rendering techniques.	Existing inverse rendering based methods for relightable 3D reconstruction are computationally expensive, brittle, and often produce implausible results under novel illumination.	The proposed method first generates a set of plausible relit images from different viewpoints using a RDM conditioned on target lighting. These images are then used to train a latent NeRF model that learns a consistent 3D representation for novel view synthesis under the target lighting.	Outperforms state-of-the-art inverse rendering methods on the synthetic TensoIR benchmark. Achieves competitive results on the real-world Stanford-ORB benchmark. Demonstrates the effectiveness of using a latent NeRF model to reconcile multiple plausible relighting solutions from the RDM.	Relies on high-quality geometry estimated from input views, which can affect the accuracy of relighting, especially for specular reflections. Not suitable for real-time relighting due to the need for generating new samples with the RDM and optimizing a NeRF for each target lighting condition.	relightable view synthesis, diffusion models, neural radiance fields, inverse rendering, 3d reconstruction
2406.06523 Report	NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing	Ting-Hsuan Chen, Jiewen Chan, Hau-Shiang Shiu, Shih-Han Yen, Chang-Han Yeh, Yu-Lun Liu	We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences. See our project page for video results at https://koi953215.github.io/NaRCan_page/.	NaRCan: a novel video editing framework that generates high-quality natural canonical images by integrating a hybrid deformation field and diffusion prior.	Existing canonical-based video editing methods often produce unnatural or distorted canonical images, hindering their application in downstream tasks like text-guided editing. This work addresses this limitation by ensuring the generation of high-quality, natural canonical images.	The method uses a hybrid deformation field combining homography and residual MLP to model video dynamics. It incorporates a diffusion prior from a LoRA fine-tuned diffusion model to enhance the naturalness of the generated canonical image. A noise and diffusion prior update scheduling technique accelerates the training process.	NaRCan outperforms existing methods in generating natural canonical images, especially in scenes with complex motion. The method demonstrates superior performance in text-guided video-to-video translation, achieving better prompt alignment, synthesis quality, and temporal consistency. NaRCan effectively handles downstream tasks such as adding handwritten characters and dynamic video segmentation, benefiting from the high quality of its generated canonical images.	LoRA fine-tuning for adapting the diffusion model to specific scenes is time-consuming. In scenarios with extreme video scene changes, the diffusion prior may not always guarantee a high-quality natural canonical image.	video editing, canonical image, diffusion model, lora, hybrid deformation field
2406.06465 Report	AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction	Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang	Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets. We observe that pretrained Image2Video diffusion models possess good priors for video dynamics but they lack textual control. Hence, transferring Image2Video models to leverage their video dynamic priors while injecting instruction control to generate controllable videos is both a meaningful and challenging task. To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Long-Short Term Temporal Adapters and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. More examples can be found at our website https://chenhsing.github.io/AID.	This paper proposes AID, a novel approach that adapts a pretrained Image2Video diffusion model for text-guided video prediction by incorporating a Multi-Modal Large Language Model (MLLM) and a Dual Query Transformer (DQFormer) to effectively integrate textual and visual conditions.	Existing text-guided video prediction models often struggle with frame consistency and temporal stability due to limitations in video dataset size. Leveraging pretrained Image2Video models with inherent video dynamic priors offers a promising solution.	The study utilizes a pretrained SVD model as the foundation and introduces MLLM to predict video states from text instructions and initial frames. A DQFormer architecture is designed to integrate these multimodal conditions. Additionally, spatial and temporal adapters are employed for efficient model transfer to specific video prediction tasks.	AID significantly outperforms state-of-the-art methods in text-guided video prediction across various datasets, including Something Something V2, Bridge Data, and Epic Kitchen-100. The approach demonstrates superior performance in capturing video dynamics and adhering to textual instructions, leading to more coherent and contextually accurate video predictions. Ablation studies confirm the effectiveness of individual components such as DQFormer, MLLM-aided prompting, and the use of spatial and temporal adapters.	The current study primarily focuses on short-term video prediction, exploring longer-term prediction is an area for future work. While the method effectively transfers to specific domains, investigating its generalization capability to entirely new and unseen scenarios is crucial.	text-guided video prediction, video diffusion models, multimodal large language models, dqformer, video generation
2406.06424 Report	Margin-aware Preference Optimization for Aligning Diffusion Models without Reference	Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong	Modern alignment techniques based on human preferences, such as RLHF and DPO, typically employ divergence regularization relative to the reference model to ensure training stability. However, this often limits the flexibility of models during alignment, especially when there is a clear distributional discrepancy between the preference data and the reference model. In this paper, we focus on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a significant problem in aligning these models due to the unstructured nature of visual modalities: e.g., a preference for a particular stylistic aspect can easily induce such a discrepancy. Motivated by this observation, we propose a novel and memory-friendly preference alignment method for diffusion models that does not depend on any reference model, coined margin-aware preference optimization (MaPO). MaPO jointly maximizes the likelihood margin between the preferred and dispreferred image sets and the likelihood of the preferred sets, simultaneously learning general stylistic features and preferences. For evaluation, we introduce two new pairwise preference datasets, which comprise self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating diverse scenarios of reference mismatch. Our experiments validate that MaPO can significantly improve alignment on Pick-Style and Pick-Safety and general preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and other existing methods. Our code, models, and datasets are publicly available via https://mapo-t2i.github.io	This paper proposes MaPO, a novel and memory-friendly preference alignment method for diffusion models, which eliminates the dependence on a reference model and addresses the issue of reference mismatch in existing alignment techniques.	Reference mismatch, a distributional discrepancy between preference data and the reference model, limits the flexibility of current alignment methods for text-to-image diffusion models, especially in aligning stylistic features.	MaPO jointly maximizes the likelihood margin between preferred and dispreferred image sets while maximizing the likelihood of preferred sets, effectively learning stylistic features and preferences simultaneously without relying on a reference model. The authors introduce two new pairwise preference datasets, Pick-Style and Pick-Safety, to evaluate alignment under different reference mismatch scenarios.	MaPO effectively adapts the text-to-image diffusion model to desired styles and aligns it with human preferences, outperforming reference-model-based methods on Pick-Style and Pick-Safety datasets. MaPO demonstrates superior performance in general preference alignment on Pick-a-Pic v2, surpassing 21 out of 25 state-of-the-art models in the Imgsys public benchmark. MaPO exhibits computational efficiency, consuming 14.5% less training time compared to Diffusion-DPO, and enables larger batch sizes due to lower memory usage.	The method and datasets might inherit biases present in the original SDXL checkpoint used for fine-tuning and curation. While MaPO demonstrates effectiveness in mitigating unsafe content, it doesn't guarantee perfect screening, and user discretion is advised. Further investigation is needed to explore scenarios with different levels of reference mismatch.	text-to-image generation, diffusion models, preference optimization, alignment, reference mismatch
2406.06382 Report	Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization	Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, Mingyuan Zhou	Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. This approach leverages both prompt-image pairs with identical prompts and those with semantically related content across various modalities. Furthermore, we have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low reproducibility, and limited interpretability prevalent in current evaluations of human preference alignment. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0, achieving superior results in both automated evaluations of human preferences and style alignment. Our code is available at https://github.com/yigu1008/Diffusion-RPO	This paper presents Diffusion-RPO, a novel approach for aligning Text-to-Image (T2I) models with human preferences by leveraging semantically related prompt-image pairs through contrastive weighting during the diffusion model sampling process.	Aligning T2I models with human preferences is crucial for generating images that better meet user expectations and artistic intentions.	Diffusion-RPO leverages both identical and semantically related prompt-image pairs to optimize the diffusion model's sampling steps. It employs contrastive weighting based on the similarity of prompts and images, measured using CLIP embeddings.	Diffusion-RPO outperforms existing preference learning baselines (Diffusion-DPO, SFT) in aligning Stable Diffusion 1.5 and SDXL models with human preferences, as evidenced by higher scores on established reward models (HPS, Pick Score). The paper introduces Style Alignment, a new evaluation task for image preference learning, and demonstrates that Diffusion-RPO excels in this task by effectively fine-tuning models to generate images consistent with specific artistic styles (Van Gogh, Sketch, Winter). Ablation studies reveal the importance of the distance temperature parameter in balancing the focus on identical versus semantically related prompt-image pairs during optimization.	The dataset used for training, while extensive, may not fully encapsulate the diverse spectrum of human preferences across all cultures and communities, potentially limiting the model's generalizability. Future research could explore methods for collecting preference datasets that better represent a wider range of cultural backgrounds and artistic styles, leading to more inclusive and universally appealing T2I models.	text-to-image synthesis, diffusion models, preference learning, style alignment, human-computer interaction
2406.06367 Report	MVGamba: Unify 3D Content Generation as State Space Sequence Modeling	Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang	Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (\eg, Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.	MVGamba is a unified 3D generation framework that leverages a novel multi-view Gaussian reconstructor based on RNN-like State Space Models (SSM) to achieve high-quality 3D content generation with low computational cost.	Existing Gaussian reconstruction models for 3D generation often compromise multi-view information propagation for computational efficiency, leading to multi-view inconsistency and blurred textures. MVGamba addresses this by efficiently integrating multi-view information while allowing for the generation of long sequences of Gaussians for detailed modeling.	MVGamba uses a two-stage pipeline: 1) Off-the-shelf multi-view diffusion models generate multi-view images from a single image or text prompt. 2) An SSM-based multi-view reconstructor processes these images causally, expanding them into long sequences of Gaussian tokens and refining them across views. A lightweight Gaussian decoder then predicts the final Gaussian parameters for 3D content representation.	MVGamba outperforms state-of-the-art baselines in image-to-3D, text-to-3D, and sparse-view reconstruction tasks. MVGamba demonstrates robustness to inconsistencies in multi-view input, effectively handling noisy or inconsistent images generated by diffusion models. The performance of MVGamba improves with increasing Gaussian sequence length, highlighting the benefit of its ability to model long sequences efficiently.	The model's performance depends on the quality of input views generated by multi-view diffusion models, which still exhibit limitations. Incorrect depth estimation in the front view can sometimes lead to generation failures, requiring manual input order adjustment as a current workaround.	3d generation, gaussian splatting, state space models, multi-view reconstruction, diffusion models
2406.06258 Report	Tuning-Free Visual Customization via View Iterative Self-Attention Control	Xiaojie Li, Chenghao Gu, Shuzhao Xie, Yunpeng Bai, Weixiang Zhang, Zhi Wang	Fine-Tuning Diffusion Models enable a wide range of personalized generation and editing applications on diverse visual modalities. While Low-Rank Adaptation (LoRA) accelerates the fine-tuning process, it still requires multiple reference images and time-consuming training, which constrains its scalability for large-scale and real-time applications. In this paper, we propose \textit{View Iterative Self-Attention Control (VisCtrl)} to tackle this challenge. Specifically, VisCtrl is a training-free method that injects the appearance and structure of a user-specified subject into another subject in the target image, unlike previous approaches that require fine-tuning the model. Initially, we obtain the initial noise for both the reference and target images through DDIM inversion. Then, during the denoising phase, features from the reference image are injected into the target image via the self-attention mechanism. Notably, by iteratively performing this feature injection process, we ensure that the reference image features are gradually integrated into the target image. This approach results in consistent and harmonious editing with only one reference image in a few denoising steps. Moreover, benefiting from our plug-and-play architecture design and the proposed Feature Gradual Sampling strategy for multi-view editing, our method can be easily extended to edit in complex visual domains. Extensive experiments show the efficacy of VisCtrl across a spectrum of tasks, including personalized editing of images, videos, and 3D scenes.	This paper proposes View Iterative Self-Attention Control (VisCtrl), a training-free method for personalized visual editing using diffusion models.	This method allows rapid personalized editing with only one reference image, overcoming limitations of existing model-based and attention-based methods that require extensive training or struggle with complex editing scenarios.	VisCtrl uses DDIM inversion to obtain initial noise for both reference and target images. During denoising, it iteratively injects features from the reference image into the target image using self-attention, while preserving the target's structure using cross-attention. A Feature Gradually Sampling strategy is introduced for multi-view editing, enabling consistent feature injection across multiple frames or views.	VisCtrl effectively personalizes images, videos, and 3D scenes with a single reference image. The method outperforms existing baselines in terms of subject fidelity, background preservation, and structural consistency, as demonstrated by quantitative metrics (CLIP-I, LPIPS, SSIM) and qualitative comparisons. Ablation studies confirm the benefits of Feature Gradually Sampling for multi-view editing and demonstrate control over the degree of subject personalization.	The method's performance depends on the accuracy of the segmentation masks used to isolate objects for editing. Potential biases in the pre-trained diffusion model may influence the generated results, although VisCtrl is designed to mitigate bias introduction.	diffusion models, personalized visual editing, self-attention, training-free, multi-view editing
2406.06216 Report	Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis	Xin Jin, Pengyi Jiao, Zheng-Peng Duan, Xingchao Yang, Chun-Le Guo, Bo Ren, Chongyi Li	Volumetric rendering based methods, like NeRF, excel in HDR view synthesis from RAWimages, especially for nighttime scenes. While, they suffer from long training times and cannot perform real-time rendering due to dense sampling requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time rendering and faster training. However, implementing RAW image-based view synthesis directly using 3DGS is challenging due to its inherent drawbacks: 1) in nighttime scenes, extremely low SNR leads to poor structure-from-motion (SfM) estimation in distant views; 2) the limited representation capacity of spherical harmonics (SH) function is unsuitable for RAW linear color space; and 3) inaccurate scene structure hampers downstream tasks such as refocusing. To address these issues, we propose LE3D (Lighting Every darkness with 3DGS). Our method proposes Cone Scatter Initialization to enrich the estimation of SfM, and replaces SH with a Color MLP to represent the RAW linear color space. Additionally, we introduce depth distortion and near-far regularizations to improve the accuracy of scene structure for downstream tasks. These designs enable LE3D to perform real-time novel view synthesis, HDR rendering, refocusing, and tone-mapping changes. Compared to previous volumetric rendering based methods, LE3D reduces training time to 1% and improves rendering speed by up to 4,000 times for 2K resolution images in terms of FPS. Code and viewer can be found in https://github.com/Srameo/LE3D .	LE3D: a novel method for HDR 3D scene reconstruction from noisy RAW images enabling real-time rendering and editing	Existing HDR scene reconstruction methods, while effective, suffer from long training times and inability to render in real-time, limiting their practical applications.	LE3D leverages 3D Gaussian Splatting (3DGS) and introduces: (1) Cone Scatter Initialization to improve SfM in low-light, (2) Color MLP to represent RAW linear color space, and (3) Depth distortion and near-far regularizations for better scene structure.	Achieves comparable visual quality to state-of-the-art volumetric rendering methods like RawNeRF. Reduces training time to 1% of RawNeRF. Enables real-time rendering at speeds up to 4,000 times faster than RawNeRF for 2K resolution.	Quantitative metrics on sRGB are slightly lower than RawNeRF, potentially due to sparser scene representation. Future work includes exploring alternative regularization techniques for further improving structural accuracy.	hdr view synthesis, 3d gaussian splatting, real-time rendering, raw image processing, computational photography
2406.05871 Report	OmniControlNet: Dual-stage Integration for Conditional Image Generation	Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu	We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.	This paper introduces OmniControlNet, which integrates external condition generation algorithms into a single method and incorporates individually trained image generation processes into a single model.	The standard ControlNet model suffers from large model redundancy, requiring separate models for different conditioning input types. This paper addresses this by creating a single, integrated model.	OmniControlNet uses a multi-task dense image prediction model for generating various image conditions (e.g., edges, depth maps). It then integrates these into a single text-to-image generation model guided by textual inversion.	OmniControlNet significantly reduces model complexity and redundancy compared to existing approaches. The model produces images of comparable quality to ControlNet for conditioned text-to-image generation. The multi-task dense image prediction component achieves competitive performance on benchmark datasets for depth and edge detection.	Adding a new task condition requires training a new embedding for that task. The integrated stage 1 model increases training complexity and slightly reduces image generation quality compared to using separate expert models.	text-to-image generation, controlnet, dense image prediction, textual inversion, model integration
2406.05835 Report	Mamba YOLO: SSMs-Based YOLO For Object Detection	Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu	Propelled by the rapid advancement of deep learning technologies, the YOLO series has set a new benchmark for real-time object detectors. Researchers have continuously explored innovative applications of reparameterization, efficient layer aggregation networks, and anchor-free techniques on the foundation of YOLO. To further enhance detection performance, Transformer-based structures have been introduced, significantly expanding the model's receptive field and achieving notable performance gains. However, such improvements come at a cost, as the quadratic complexity of the self-attention mechanism increases the computational burden of the model. Fortunately, the emergence of State Space Models (SSM) as an innovative technology has effectively mitigated the issues caused by quadratic complexity. In light of these advancements, we introduce Mamba-YOLO a novel object detection model based on SSM. Mamba-YOLO not only optimizes the SSM foundation but also adapts specifically for object detection tasks. Given the potential limitations of SSM in sequence modeling, such as insufficient receptive field and weak image locality, we have designed the LSBlock and RGBlock. These modules enable more precise capture of local image dependencies and significantly enhance the robustness of the model. Extensive experimental results on the publicly available benchmark datasets COCO and VOC demonstrate that Mamba-YOLO surpasses the existing YOLO series models in both performance and competitiveness, showcasing its substantial potential and competitive edge.The PyTorch code is available at:\url{https://github.com/HZAI-ZJNU/Mamba-YOLO}	Presents Mamba-YOLO, a novel object detection model based on State Space Models (SSM) that achieves a new performance baseline for YOLO-based detectors while maintaining real-time performance.	Aims to address limitations of existing CNN and Transformer-based detectors by leveraging the strengths of SSMs for capturing global dependencies while effectively extracting local features.	Introduces ODSSBlock, a core module integrating SSMs with novel LocalSpatial Block (LSBlock) and ResGated Block (RGBlock) to enhance local feature extraction and model robustness. Leverages VisionClue Merge to preserve visual information for SSM processing.	Mamba-YOLO significantly outperforms existing YOLO series models in terms of accuracy and efficiency on COCO and VOC datasets. Mamba-YOLO-T achieves a 3.4% higher AP than the best performing tiny lightweight models while significantly reducing parameters and FLOPs. Ablation studies demonstrate the effectiveness of individual components, including ODSSBlock, LSBlock, and RGBlock, in enhancing detection performance.	The model's performance on dense object detection tasks requires further investigation. Future work will explore the integration of advanced object detection heads and training strategies to further improve Mamba-YOLO's capabilities.	object detection, state space models, yolo, real-time, computer vision
2406.05821 Report	F-LMM: Grounding Frozen Large Multimodal Models	Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy	Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing pronounced performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention weights of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, our F-LMM can perform visual chain-of-thought reasoning and better resist object hallucinations.	This paper presents F-LMM, a novel method for grounding frozen large multimodal models (LMMs) in human-AI conversations, leveraging existing attention weights as segmentation priors to achieve competitive visual grounding without sacrificing the LMM's conversational abilities.	Existing methods for grounding LMMs often lead to a decline in their general knowledge and instruction-following abilities, which are crucial for building effective general AI assistants. This paper aims to address this issue by proposing a method that preserves the LMM's original conversational capabilities while enabling visual grounding.	F-LMM utilizes a mask head consisting of a CNN-based mask decoder and a SAM-based mask refiner. The mask decoder translates word-pixel attention weights from the frozen LMM into mask logits, and the mask refiner further optimizes these predictions using image and language cues. The model is trained on referring expression segmentation and panoptic narrative grounding datasets.	F-LMM achieves competitive performance on both referring expression segmentation and phrase grounding benchmarks, indicating its effectiveness in visual grounding. Unlike existing grounding LMMs, F-LMM maintains the original LMM's excellence on general question-answering benchmarks, demonstrating its preserved conversational ability. F-LMM exhibits improved performance on visual chain-of-thought reasoning and resistance to object hallucinations, highlighting the potential of combining grounding and conversational abilities.	The study is limited to LMMs with up to 8 billion parameters due to computational constraints. The paper primarily focuses on vision-language interactions and does not explore other modalities such as video or audio.	large multimodal models, visual grounding, instruction following, conversational ai, visual chain-of-thought reasoning
2406.05814 Report	Unified Text-to-Image Generation and Retrieval	Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua	How humans can efficiently and effectively acquire images has always been a perennial question. A typical solution is text-to-image retrieval from an existing database given the text query; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce fancy and diverse visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval and propose a unified framework in the context of Multimodal Large Language Models (MLLMs). Specifically, we first explore the intrinsic discriminative abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. Subsequently, we unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images as the response to the text query. Additionally, we construct a benchmark called TIGeR-Bench, including creative and knowledge-intensive domains, to standardize the evaluation of unified text-to-image generation and retrieval. Extensive experimental results on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority and effectiveness of our proposed method.	This paper proposes TIGeR, a unified framework for text-to-image generation and retrieval within Multimodal Large Language Models (MLLMs).	This unified approach aims to address the limitations of individual text-to-image generation (struggles with knowledge-intensive concepts) and retrieval (limited to existing databases) methods.	The framework leverages MLLMs' intrinsic discriminative abilities for semantic matching, employing generative retrieval with forward beam search and reverse re-ranking. An autonomous decision mechanism selects between generated and retrieved images based on user prompts.	TIGeR outperforms expert generation and retrieval models, as well as existing MLLMs, on the TIGeR-Bench, a newly constructed benchmark for unified text-to-image generation and retrieval. The proposed generative retrieval method achieves state-of-the-art results on Flickr30K and MS-COCO retrieval benchmarks, surpassing specially trained generative retrieval models. The study demonstrates the effectiveness of visual modality debiasing and the impact of forward beam search and reverse re-ranking on retrieval performance.	The decision-making module exhibits a generation preference, potentially due to discrepancies between pre-training data and the TIGeR-Bench. Further investigation is needed to mitigate modality biases and explore the complex interplay between generation and retrieval within the TIGeR framework.	text-to-image generation, text-to-image retrieval, multimodal large language models, generative retrieval, semantic matching
2406.05785 Report	A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions	Daizong Liu, Yang Liu, Wencan Huang, Wei Hu	Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing. In this survey, we attempt to provide a comprehensive overview of the T-3DVG progress, including its fundamental elements, recent research advances, and future research directions. To the best of our knowledge, this is the first systematic survey on the T-3DVG task. Specifically, we first provide a general structure of the T-3DVG pipeline with detailed components in a tutorial style, presenting a complete background overview. Then, we summarize the existing T-3DVG approaches into different categories and analyze their strengths and weaknesses. We also present the benchmark datasets and evaluation metrics to assess their performances. Finally, we discuss the potential limitations of existing T-3DVG and share some insights on several promising research directions. The latest papers are continually collected at https://github.com/liudaizong/Awesome-3D-Visual-Grounding.	This paper presents the first comprehensive survey of text-guided 3D visual grounding (T-3DVG), covering fundamental elements, recent advances, and future directions.	T-3DVG is crucial for multimedia intelligence research and real-world 3D applications like robotic navigation and human-computer interaction. It bridges the gap between language and 3D scenes, enabling retrieval of specific objects from complex point cloud data.	The authors analyze existing T-3DVG methods, categorizing them based on their architectures (two-stage vs. one-stage) and learning paradigms (fully-supervised vs. weakly-supervised). They also discuss the use of additional modalities and large language models.	Two-stage methods, while initially lagging, have seen performance improvements by incorporating text guidance to refine object locations. One-stage methods demonstrate efficiency but face challenges in capturing fine-grained spatial relations. Multi-modal approaches, leveraging 2D images or multi-view data, consistently outperform those relying solely on 3D point clouds.	Current methods heavily rely on expensive annotations, hindering their scalability. There's a need to develop more practical T-3DVG settings, moving beyond single object grounding to handle dense object retrieval and grounding within groups of related scenes.	text-guided 3d visual grounding, cross-modal reasoning, multimodal learning, 3d scene understanding, object retrieval
2406.05768 Report	MLCM: Multistep Consistency Distillation of Latent Diffusion Model	Qingsong Xie, Zhenyi Liao, Chen chen, Zhijie Deng, Shixiang Tang, Haonan Lu	Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.	The paper proposes Multistep Latent Consistency Models (MLCMs), a novel method for accelerating text-to-image latent diffusion models, enabling high-quality image generation in just 2-8 sampling steps.	Large latent diffusion models (LDMs) often suffer from slow inference speeds. Existing distillation methods for acceleration either rely on multiple models or compromise quality, particularly with few sampling steps. This work addresses these limitations for faster, higher-quality image generation.	The method extends multistep consistency distillation (MCD) to LDMs, dividing the denoising trajectory into segments and enforcing consistency within each. It introduces progressive training for inter-segment consistency, utilizes samples from the teacher model for image-free training, and incorporates reward learning for improved human preference alignment.	MLCM achieves state-of-the-art results with a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 in just 4 steps, surpassing competing baselines. The model exhibits consistent quality improvement with additional sampling steps. MLCM demonstrates versatility across applications like controllable generation, image stylization, and Chinese-to-image generation.	Single-step generation quality using MLCM still holds potential for further improvement. The broader societal impact of accelerating image generation, including potential misuse for creating misleading or harmful content, requires careful consideration.	image generation, latent diffusion models, model acceleration, consistency distillation, reward learning
2406.05766 Report	Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment	Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li	Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.	The paper proposes Gentle-CLIP, a semi-supervised learning method for multimodal alignment based on CLIP, designed to address the challenge of limited alignment data in specialized fields by leveraging vast unmatched data.	Many specialized fields struggle to obtain sufficient alignment data, limiting the effectiveness of traditional multimodal models like CLIP that rely solely on matched pairs for training.	Gentle-CLIP transforms semi-supervised multimodal alignment into a manifold matching problem. It introduces a novel semantic density distribution (SDD) loss to capture implicit semantic alignment from unpaired data, along with multi-kernel maximum mean discrepancy (MK-MMD) and self-supervised contrastive loss to refine representation alignment and stability.	Gentle-CLIP outperforms existing semi-supervised methods in protein representation tasks, achieving strong performance on fold classification, enzyme commission number prediction, and other benchmarks. In remote sensing tasks, Gentle-CLIP consistently improves zero-shot classification and image-text retrieval results compared to baselines, highlighting its ability to learn from limited matched pairs. Gentle-CLIP demonstrates promising results in general vision-language retrieval tasks, particularly with ViT as the image encoder on the Mini COCO dataset, indicating its broader applicability.	The performance of Gentle-CLIP relies on the assumption that the semantic distributions of the unmatched data are sufficiently similar. Further exploration of augmentation techniques that consider common semantics across modalities could potentially enhance Gentle-CLIP's performance.	multimodal alignment, semi-supervised learning, contrastive learning, manifold matching, clip
2406.05723 Report	Binarized Diffusion Model for Image Super-Resolution	Zheng Chen, Haotong Qin, Yong Guo, Xiongfei Su, Xin Yuan, Linghe Kong, Yulun Zhang	Advanced diffusion models (DMs) perform impressively in image super-resolution (SR), but the high memory and computational costs hinder their deployment. Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating DMs. Nonetheless, due to the model structure and the multi-step iterative attribute of DMs, existing binarization methods result in significant performance degradation. In this paper, we introduce a novel binarized diffusion model, BI-DiffSR, for image SR. First, for the model structure, we design a UNet architecture optimized for binarization. We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent and facilitate the full-precision information transfer. Meanwhile, we design the channel-shuffle-fusion (CS-Fusion) to enhance feature fusion in skip connection. Second, for the activation difference across timestep, we design the timestep-aware redistribution (TaR) and activation function (TaA). The TaR and TaA dynamically adjust the distribution of activations based on different timesteps, improving the flexibility and representation alability of the binarized module. Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods. Code is available at https://github.com/zhengchen1999/BI-DiffSR.	This paper proposes BI-DiffSR, a novel binarized diffusion model for efficient and accurate image super-resolution.	Diffusion models excel in image super-resolution but their high computational and memory demands hinder deployment on resource-constrained devices. Binarization offers a solution, but directly applying existing methods to diffusion models leads to significant performance degradation.	BI-DiffSR introduces a UNet architecture tailored for binarization, featuring Consistent-Pixel Down/Upsampling (CP-Down/Up) for dimension consistency and Channel-Shuffle Fusion (CS-Fusion) for enhanced feature fusion. Additionally, it incorporates Timestep-Aware Redistribution (TaR) and Activation Function (TaA) to handle varying activation distributions across diffusion timesteps.	BI-DiffSR significantly outperforms state-of-the-art binarization methods in image super-resolution tasks. It achieves comparable or even better perceptual quality than the full-precision diffusion model (SR3) while utilizing only 8.3% of the parameters and 20.8% of the computational operations. The proposed model effectively restores fine details and textures in challenging cases, as demonstrated through visual comparisons.	The introduction of TaR and TaA, while improving performance, leads to increased parameters and training time. The fixed timestep grouping strategy in TaR and TaA may not be optimal for all modules due to non-uniform activation changes across timesteps.	image super-resolution, diffusion models, binarization, model compression, unet architecture
2406.05649 Report	GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement	Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee	We propose a novel approach for 3D mesh reconstruction from multi-view images. Our method takes inspiration from large reconstruction models like LRM that use a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. However, in our method, we introduce several important modifications that allow us to significantly enhance 3D reconstruction quality. First of all, we examine the original LRM architecture and find several shortcomings. Subsequently, we introduce respective modifications to the LRM architecture, which lead to improved multi-view image representation and more computationally efficient training. Second, in order to improve geometry reconstruction and enable supervision at full image resolution, we extract meshes from the NeRF field in a differentiable manner and fine-tune the NeRF model through mesh rendering. These modifications allow us to achieve state-of-the-art performance on both 2D and 3D evaluation metrics, such as a PSNR of 28.67 on Google Scanned Objects (GSO) dataset. Despite these superior results, our feed-forward model still struggles to reconstruct complex textures, such as text and portraits on assets. To address this, we introduce a lightweight per-instance texture refinement procedure. This procedure fine-tunes the triplane representation and the NeRF color estimation model on the mesh surface using the input multi-view images in just 4 seconds. This refinement improves the PSNR to 29.79 and achieves faithful reconstruction of complex textures, such as text. Additionally, our approach enables various downstream applications, including text- or image-to-3D generation.	GTR, a novel 3D reconstruction model for generating high-quality meshes with faithful textures from multi-view images in seconds.	Existing methods struggle to balance high-quality texture reconstruction with accurate geometry extraction. This work aims to improve both aspects of 3D reconstruction from multi-view images.	The authors propose a three-pronged approach: 1. Modifying the standard LRM architecture for improved multi-view image representation and computational efficiency. 2. Introducing a two-stage training procedure using NeRF volume rendering for initialization, followed by geometry refinement via differentiable mesh rendering. 3. Implementing a per-instance texture refinement procedure for enhancing intricate details on the mesh surface.	GTR achieves state-of-the-art performance on both 2D and 3D evaluation metrics, surpassing baselines like LRM and InstantMesh. The proposed method excels at reconstructing complex textures and fine details, including text and portraits. The model is computationally efficient, generating meshes within a second and requiring only four seconds for texture refinement.	The current pipeline trains the convolutional encoder from scratch, potentially limiting convergence speed. Exploring pre-trained models like Stable Diffusion's autoencoder could be beneficial. The mesh rendering stage relies on NeRF for initialization. Investigating alternative methods like NeuS, which directly generates SDFs, might offer further improvements.	3d reconstruction, mesh generation, texture refinement, multi-view images, neural rendering
2406.05641 Report	PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction	Shangyu Chen, Zizheng Pan, Jianfei Cai, Dinh Phung	Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel concept with only a few target images to achieve personalization (aligning with the personalized target) while preserving text editability (aligning with diverse text prompts). In this paper, we propose PaRa, an effective and efficient Parameter Rank Reduction approach for T2I model personalization by explicitly controlling the rank of the diffusion model parameters to restrict its initial diverse generation space into a small and well-balanced target space. Our design is motivated by the fact that taming a T2I model toward a novel concept such as a specific art style implies a small generation space. To this end, by reducing the rank of model parameters during finetuning, we can effectively constrain the space of the denoising sampling trajectories towards the target. With comprehensive experiments, we show that PaRa achieves great advantages over existing finetuning approaches on single/multi-subject generation as well as single-image editing. Notably, compared to the prevailing fine-tuning technique LoRA, PaRa achieves better parameter efficiency (2x fewer learnable parameters) and much better target image alignment.	This paper proposes PaRa, a novel parameter-efficient framework for personalizing text-to-image diffusion models through parameter rank reduction.	Existing T2I personalization methods struggle to balance preserving text editability with aligning to target concepts. PaRa addresses this by explicitly controlling the diffusion model parameter rank to constrain image generation to a well-aligned space.	PaRa reduces the rank of layer outputs during denoising sampling by introducing a low-rank learnable parameter, utilizing QR decomposition to form orthonormal bases. It also enables combining multiple individually fine-tuned PaRa weights for multi-subject generation.	PaRa achieves better image alignment than LoRA and SVDiff while using fewer learnable parameters. The framework allows blending multiple personalized concepts for multi-subject generation without additional training on augmented data. PaRa facilitates stable single-image editing by directly modifying text prompts without requiring noise inversion.	PaRa currently focuses on reducing the output space, potentially limiting customization requiring space expansion. Future work could explore methods for both space extension and reduction within the framework.	text-to-image synthesis, diffusion models, model personalization, parameter rank reduction, image editing
2406.05630 Report	Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion	Ge Ya Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal	With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest. To this end, we propose a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning. In addition, we also create a bounding box predictor that, given the initial and ending frames' bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip. We perform experiments across 3 well-known AV video datasets: KITTI, Virtual-KITTI 2 and BDD100k.	The paper introduces Ctrl-V, a novel model that generates controllable autonomous vehicle videos by conditioning on predicted sequences of 2D and 3D bounding boxes.	Generating controllable, high-fidelity videos is crucial for applications like autonomous vehicle simulation, enabling realistic and customizable virtual environments for training and testing.	Ctrl-V comprises two main components: 1) a diffusion-based bounding box predictor (\modelbbox) that forecasts object positions across frames and 2) a ControlNet-adapted video diffusion model (\modelvid) that generates videos adhering to the predicted bounding box trajectories.	Ctrl-V demonstrates the ability to generate high-fidelity videos that closely align with the provided bounding box conditions, as evidenced by quantitative metrics like FVD, LPIPS, SSIM, and PSNR. The \modelbbox component effectively predicts bounding box trajectories, achieving high alignment scores with ground-truth labels, especially for the initial and final frames. The \modelvid component exhibits strong motion control capabilities, accurately depicting object movements and handling uninitialized objects appearing mid-video.	The current evaluation metrics for bounding box predictions have limitations, as they rely on binary masks and do not consider object tracking IDs. Further investigation is needed to systematically analyze the model's ability to encode and utilize additional information, such as track IDs and 3D bounding box orientation.	video generation, controllable video generation, diffusion models, bounding box prediction, autonomous driving
2406.05602 Report	Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models	Philip Wootaek Shin, Jihyun Janice Ahn, Wenpeng Yin, Jack Sampson, Vijaykrishnan Narayanan	It has been shown that many generative models inherit and amplify societal biases. To date, there is no uniform/systematic agreed standard to control/adjust for these biases. This study examines the presence and manipulation of societal biases in leading text-to-image models: Stable Diffusion, DALL-E 3, and Adobe Firefly. Through a comprehensive analysis combining base prompts with modifiers and their sequencing, we uncover the nuanced ways these AI technologies encode biases across gender, race, geography, and region/culture. Our findings reveal the challenges and potential of prompt engineering in controlling biases, highlighting the critical need for ethical AI development promoting diversity and inclusivity. This work advances AI ethics by not only revealing the nuanced dynamics of bias in text-to-image generation models but also by offering a novel framework for future research in controlling bias. Our contributions-panning comparative analyses, the strategic use of prompt modifiers, the exploration of prompt sequencing effects, and the introduction of a bias sensitivity taxonomy-lay the groundwork for the development of common metrics and standard analyses for evaluating whether and how future AI models exhibit and respond to requests to adjust for inherent biases.	This paper investigates societal biases in text-to-image models (Stable Diffusion, DALL·E 3, Adobe Firefly) and explores if prompt engineering with modifiers can control these biases.	Understanding and mitigating biases in AI models is crucial to ensure they are fair, inclusive, and do not perpetuate harmful stereotypes.	The researchers analyzed image outputs for various prompts, including base prompts with added modifiers, to examine bias representation across gender, race, geography, and culture.	Prompt modifiers can sometimes adjust bias, but simplistic use is not always effective, highlighting the need for more sophisticated strategies. Some model biases are resistant to control through prompt engineering, demonstrating the deep-rooted nature of these biases. Prompt sequencing, i.e., the order of base prompt and modifier, can significantly impact the generated images and bias representation.	The study was limited by a small image dataset and the lack of external human evaluation for bias assessment. Future work could focus on developing more robust bias-control mechanisms and conducting large-scale human evaluations to assess bias in a nuanced way.	ai bias, text-to-image generation, prompt engineering, ethical ai, bias mitigation
2406.05478 Report	Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis	Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang	The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at https://github.com/LeapLabTHU/ImprovedNAT.	This paper proposes ImprovedNAT, a novel method to automatically search for optimal training and generation strategies for Non-Autoregressive Transformers (NATs) in image synthesis, improving their performance and efficiency.	NATs offer fast image generation but often lag behind diffusion models in quality due to sub-optimal, heuristically designed training and generation strategies.	ImprovedNAT formulates the optimal strategy design as a unified optimization problem and solves it using an alternating optimization algorithm for efficient exploration of the strategy space.	ImprovedNAT significantly enhances NATs' performance, achieving results comparable to state-of-the-art diffusion models. ImprovedNAT achieves approximately 5x inference speedup compared to diffusion models without sacrificing performance. The study highlights that optimizing both training and generation strategies is crucial for NATs, with the latter demonstrating a larger impact.	The paper mainly focuses on optimizing Beta distribution for training strategy; exploring other distributions could be beneficial. The current work primarily explores image generation; extending ImprovedNAT to other domains like audio or video generation is a promising direction.	image synthesis, non-autoregressive transformers, diffusion models, hyperparameter optimization, generative models
2406.05338 Report	MotionClone: Training-Free Motion Cloning for Controllable Video Generation	Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin	Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.	MotionClone, a training-free framework, clones motion from reference videos for controllable text-to-video generation using temporal attention.	Addresses limitations of existing motion-guided video generation methods that require motion-specific training or fine-tuning, leading to suboptimal results outside the trained domain.	Uses temporal attention in video inversion to represent motion from a reference video and guides video generation through primary temporal-attention guidance and location-aware semantic guidance.	Effectively clones both global camera motion and local object motion. Achieves superior motion fidelity and textual alignment compared to existing methods. Demonstrates strong temporal consistency in generated videos.	Motion in the reference video must be suitable for the objects in the new prompt to avoid unrealistic outputs. Some generated samples may still retain minor structural elements from the reference video.	text-to-video generation, motion cloning, temporal attention, video diffusion models, controllable video generation
2406.05271 Report	USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation	Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, Liu Ren	The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper, we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g., querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks, we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods.	This paper introduces the Universal Segment Embedding (USE) framework for open-vocabulary image segmentation, which can classify image segments into text-defined categories in a zero-shot manner.	Open-vocabulary image segmentation is crucial for real-world applications requiring flexible and adaptable segmentation models. Existing methods struggle to fully utilize segments from foundation models like SAM.	The USE framework includes: (1) a data pipeline that automatically generates segment-text pairs with rich semantics at multiple granularities from existing datasets and (2) a lightweight segment embedding model that learns to align segment and text embeddings in a joint vision-language space.	USE outperforms state-of-the-art two-stage open-vocabulary semantic segmentation methods on ADE20K and Pascal Context benchmarks. USE demonstrates strong performance on open-vocabulary part segmentation, exceeding VLPart trained on human-annotated parts data. Ablation studies show the benefits of combining CLIP and DINOv2 in the image encoder and incorporating the cls token for improved performance.	The current implementation of USE relies on SAM for segment generation, inheriting limitations in capturing parts with blurry boundaries. Future work can explore more sophisticated architectures for the segment embedding head, such as prompt encoders or cross-attention mechanisms.	open-vocabulary image segmentation, segment embedding, zero-shot learning, vision-language models, foundation models
2406.05184 Report	The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better	Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna	Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. What additional value does the intermediate generator provide over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from our simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.	This paper investigates whether training on synthetic images generated by text-to-image models like Stable Diffusion offers any benefits over directly training on relevant subsets of the original data used to train the generator (e.g., LAION-2B).	The use of synthetic data for training vision models is on the rise, but it's crucial to understand if it provides any advantages over directly leveraging the generator's original training data.	The authors curate targeted synthetic datasets by prompting Stable Diffusion and targeted real datasets by retrieving from LAION-2B. They finetune a pretrained CLIP model on both types of data and compare their performance on five image classification benchmarks.	Training on targeted real data consistently matches or outperforms training on targeted synthetic data from Stable Diffusion at equivalent data scales. Increasing the scale of synthetic data does not always close the performance gap and can sometimes even hurt performance. Analysis suggests that generator artifacts and distorted visual details in synthetic images contribute to their lower performance.	Compute limitations restricted the exploration of various pretrained backbones and adaptation methods. The study primarily focused on Stable Diffusion due to the availability of its training data (LAION-2B) for retrieval.	synthetic data, image classification, data augmentation, text-to-image generation, stable diffusion
2406.05132 Report	3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination	Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai	The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io	The paper introduces 3D-GRAND, a large-scale dataset with 40,087 household scenes and 6.2 million densely grounded scene-language instructions, and 3D-POPE, a benchmark for evaluating object hallucination in 3D-LLMs.	Existing 3D-LLMs lack large-scale, densely grounded datasets crucial for tasks like robotics and suffer from object hallucination, hindering their reliability and interpretability.	3D-GRAND leverages LLMs (GPT-4) for scalable and cost-effective dense grounding annotation of synthetic 3D scenes. 3D-POPE uses a polling-based approach with existence questions to assess object hallucination in 3D-LLMs.	Training with 3D-GRAND significantly reduces object hallucination in 3D-LLMs. Densely grounded instruction tuning with 3D-GRAND improves the grounding capabilities of 3D-LLMs, achieving state-of-the-art performance on ScanRefer. Scaling densely grounded data consistently improves grounding accuracy and reduces hallucination, with promising results for sim-to-real transfer from synthetic to real-world 3D scenes.	The work focuses on room-level 3D-Text pairs, lacking part-level and beyond-room-level annotations. 3D-POPE evaluation is limited to ScanNet scenes and does not include synthetic datasets or more diverse indoor environments.	3d vision-language, dense grounding, object hallucination, 3d-llm, sim-to-real transfer
2406.05082 Report	CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion	Xingrui Wang, Xin Li, Zhibo Chen	Tuning-free long video diffusion has been proposed to generate extended-duration videos with enriched content by reusing the knowledge from pre-trained short video diffusion model without retraining. However, most works overlook the fine-grained long-term video consistency modeling, resulting in limited scene consistency (i.e., unreasonable object or background transitions), especially with multiple text inputs. To mitigate this, we propose the Consistency Noise Injection, dubbed CoNo, which introduces the "look-back" mechanism to enhance the fine-grained scene transition between different video clips, and designs the long-term consistency regularization to eliminate the content shifts when extending video contents through noise prediction. In particular, the "look-back" mechanism breaks the noise scheduling process into three essential parts, where one internal noise prediction part is injected into two video-extending parts, intending to achieve a fine-grained transition between two video clips. The long-term consistency regularization focuses on explicitly minimizing the pixel-wise distance between the predicted noises of the extended video clip and the original one, thereby preventing abrupt scene transitions. Extensive experiments have shown the effectiveness of the above strategies by performing long-video generation under both single- and multi-text prompt conditions. The project has been available in https://wxrui182.github.io/CoNo.github.io/.	Proposes Consistency Noise Injection (CoNo), a tuning-free long video diffusion method that enhances long-term consistency in generated videos, especially under multiple text prompts.	Addresses limitations in existing tuning-free long video generation methods, such as coarse transitions between video clips and lack of explicit long-term content consistency modeling.	Introduces a 'look-back' mechanism with customized noise shuffling strategies to ensure fine-grained transitions between video clips and proposes long-term consistency regularization to minimize content shifts in extended videos.	Achieves state-of-the-art scene consistency and perceptual quality in long video generation. Demonstrates effectiveness under both single- and multi-text prompt conditions. Outperforms existing methods in quantitative metrics such as FVD, KVD, CLIP-Image, and CLIP-Text, and receives higher ratings in human evaluation for semantic alignment, content consistency, realism, and preference.	Performance might be limited by the capabilities of the pre-trained base video generation model. Future work includes exploring prompt engineering to further enhance the continuity and semantic coherence of generated long videos.	video generation, long video diffusion, text-to-video, scene consistency, tuning-free
2406.05038 Report	Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs	Shentong Mo	Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model's scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.	Introduces DiM-3D, a novel diffusion mamba architecture for efficient and scalable 3D point cloud generation, addressing the computational challenges of traditional methods.	High-resolution 3D shape generation is crucial for various applications, but existing methods struggle with scalability and efficiency. DiM-3D tackles these limitations.	Leverages the Mamba architecture's selective state space approach to maintain linear complexity with sequence length, enabling efficient handling of high-resolution voxel data.	Achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes on the ShapeNet benchmark. Demonstrates superior results in 3D point cloud completion tasks, highlighting its capacity for conditional generation. Exhibits strong scalability, with performance improvements observed with increasing model size and the number of classes.	Computational demands, while reduced, might still pose challenges in resource-constrained environments, particularly with extremely high-resolution data. Model's generalizability might be affected by the quality and diversity of the training data, potentially limiting its applicability in scenarios with limited or biased data.	3d shape generation, point cloud generation, diffusion models, mamba architecture, state space models
2406.05000 Report	AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation	Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, Xudong Mao	Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while DreamBooth often overlooks it. We attribute these issues to the incorrect learning of the embedding alignment for the concept. We introduce AttnDreamBooth, a novel approach that addresses these issues by separately learning the embedding alignment, the attention map, and the subject identity in different training stages. We also introduce a cross-attention map regularization term to enhance the learning of the attention map. Our method demonstrates significant improvements in identity preservation and text alignment compared to the baseline methods.	This paper proposes AttnDreamBooth, a novel text-to-image personalization approach that addresses limitations in embedding alignment found in Textual Inversion and DreamBooth.	Balancing identity preservation and text alignment in personalized image synthesis remains a challenge, hindering the generation of high-quality personalized images with flexible textual control.	AttnDreamBooth separates the learning of embedding alignment, attention map, and subject identity into three stages: 1) optimizing textual embedding for alignment, 2) fine-tuning cross-attention layers for attention map refinement, and 3) fine-tuning the entire U-Net for subject identity. It also introduces a cross-attention map regularization term for enhanced attention map learning.	AttnDreamBooth demonstrates superior performance in identity preservation and text alignment compared to baseline methods. It enables text-aligned personalized image generation, even with complex prompts. User study shows a clear preference for AttnDreamBooth over baselines in terms of identity preservation and text alignment.	The current implementation uses consistent training steps across different concepts, potentially limiting performance for certain concepts. The three-stage training method requires approximately 20 minutes on average to learn a concept.	text-to-image personalization, dreambooth, textual inversion, attention map, embedding alignment
2406.04906 Report	RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection	Liting Huang, Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Shoujin Wang	The recent advancements in generative AI models, which can create realistic and human-like content, are significantly transforming how people communicate, create, and work. While the appropriate use of generative AI models can benefit the society, their misuse poses significant threats to data reliability and authentication. However, due to a lack of aligned multimodal datasets, effective and robust methods for detecting machine-generated content are still in the early stages of development. In this paper, we introduce RU-AI, a new large-scale multimodal dataset designed for the robust and efficient detection of machine-generated content in text, image, and voice. Our dataset is constructed from three large publicly available datasets: Flickr8K, COCO, and Places205, by combining the original datasets and their corresponding machine-generated pairs. Additionally, experimental results show that our proposed unified model, which incorporates a multimodal embedding module with a multilayer perceptron network, can effectively determine the origin of the data (i.e., original data samples or machine-generated ones) from RU-AI. However, future work is still required to address the remaining challenges posed by RU-AI. The source code and dataset are available at https://github.com/ZhihaoZhang97/RU-AI.	This document describes ACM's consolidated LaTeX template (acmart) introduced in 2017 for preparing various types of publications.	Provides a consistent style and incorporates accessibility and metadata-extraction functionality for ACM Digital Library.	Explains the features of the 'acmart' document class and its parameters like template style, language support, and specific features for SIGCHI Extended Abstracts.	Supports various ACM publication types (journal, conference, etc.) and stages (review, camera-ready). Offers multilingual support with commands for translations. Includes specific environments for SIGCHI Extended Abstracts to format text, figures, and tables in the margin.	The summary is based on a template overview and may not cover all nuances of the actual LaTeX template. Further information on specific features and usage can be found in the LaTeX User’s Guide.	latex, acm publications, template, metadata extraction, accessibility
2406.04888 Report	Zero-Shot Video Editing through Adaptive Sliding Score Distillation	Lianghan Zhu, Yanqi Bao, Jing Huo, Jing Wu, Yu-Kun Lai, Wenbin Li, Yang Gao	The burgeoning field of text-based video generation (T2V) has reignited significant interest in the research of controllable video editing. Although pre-trained T2V-based editing models have achieved efficient editing capabilities, current works are still plagued by two major challenges. Firstly, the inherent limitations of T2V models lead to content inconsistencies and motion discontinuities between frames. Secondly, the notorious issue of over-editing significantly disrupts areas that are intended to remain unaltered. To address these challenges, our work aims to explore a robust video-based editing paradigm based on score distillation. Specifically, we propose an Adaptive Sliding Score Distillation strategy, which not only enhances the stability of T2V supervision but also incorporates both global and local video guidance to mitigate the impact of generation errors. Additionally, we modify the self-attention layers during the editing process to further preserve the key features of the original video. Extensive experiments demonstrate that these strategies enable us to effectively address the aforementioned challenges, achieving superior editing performance compared to existing state-of-the-art methods.	This paper proposes ASSD, a novel score distillation-based video editing method, enhancing editing quality and addressing limitations in current text-to-video generation models.	Existing text-based video editing methods suffer from content inconsistencies, motion discontinuities, and over-editing. This work aims to address these challenges using a robust score distillation-based paradigm.	The paper introduces Adaptive Sliding Score Distillation (ASSD) for robust video editing. It uses a sliding window approach for smoothing gradient information and incorporates a weighted attention fusion mechanism to preserve details from the original video. Additionally, it leverages Stable Diffusion for joint guidance in updating the latent code.	ASSD effectively reduces contaminations and preserves original video content. The weighted attention fusion mechanism further improves editing quality by preserving details. Joint guidance from Stable Diffusion enhances the accuracy of update gradients.	The performance heavily relies on the capability of the text-to-video model used. The lack of sufficiently powerful open-source text-to-video models limits the method's potential.	video editing, text-to-video generation, score distillation, diffusion models, adaptive sliding window
2406.04875 Report	3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views	Xiaobiao Du, Haiyang Sun, Shuyun Wang, Zhuojie Wu, Hongwei Sheng, Jiaying Ying, Ming Lu, Tianqing Zhu, Kun Zhan, Xin Yu	3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, presenting a significant gap toward the high-quality real-world 3D car datasets and limiting their applications in practical scenarios. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three distinctive features. (1) \textbf{High-Volume}: 2,500 cars are meticulously scanned by 3D scanners, obtaining car images and point clouds with real-world dimensions; (2) \textbf{High-Quality}: Each car is captured in an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) \textbf{High-Diversity}: The dataset contains various cars from over 100 brands, collected under three distinct lighting conditions, including reflective, standard, and dark. Additionally, we offer detailed car parsing maps for each instance to promote research in car parsing tasks. Moreover, we remove background point clouds and standardize the car orientation to a unified axis for the reconstruction only on cars without background and controllable rendering. We benchmark 3D reconstruction results with state-of-the-art methods across each lighting condition in 3DRealCar. Extensive experiments demonstrate that the standard lighting condition part of 3DRealCar can be used to produce a large number of high-quality 3D cars, improving various 2D and 3D tasks related to cars. Notably, our dataset brings insight into the fact that recent 3D reconstruction methods face challenges in reconstructing high-quality 3D cars under reflective and dark lighting conditions. \textcolor{red}{\href{https://xiaobiaodu.github.io/3drealcar/}{Our dataset is available here.}}	This paper introduces 3DRealCar, the first large-scale dataset of 3D real cars, offering high volume (2,500 instances), high quality (dense, high-resolution 360-degree RGB-D views), and high diversity (100+ brands, 3 lighting conditions).	Existing 3D car datasets are limited by being synthetic or low-quality, hindering real-world applications like autonomous driving simulations and realistic 3D modeling.	Cars were scanned using 3D scanners on smartphones, capturing dense RGB-D images and point clouds. Data preprocessing included background removal, orientation rectification, and point cloud rescaling. The dataset was annotated with car brand, type, color, and parsing maps.	3DRealCar enables high-quality 3D car reconstruction, especially under standard lighting, as benchmarked with state-of-the-art methods. Existing methods struggle with reconstructing cars under reflective and dark lighting conditions, posing a new challenge for future research. 3DRealCar enhances the performance of 3D generation and novel view synthesis models by providing real-car priors, improving realism.	3DRealCar currently only includes car exterior views, limiting its use for interior modeling. Future work includes expanding the dataset with interior views and exploring methods for robust reconstruction under challenging lighting.	3d reconstruction, dataset, autonomous driving, car modeling, computer vision
2406.04746 Report	PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction	Eduard Poesina, Adriana Valentina Costache, Adrian-Gabriel Chifu, Josiane Mothe, Radu Tudor Ionescu	Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, due to the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. In order to determine the difficulty of the same prompts in image retrieval, we also collect manual annotations that represent retrieval performance. We thus propose the first benchmark for joint text-to-image prompt and query performance prediction, comprising 10K queries. Our benchmark enables: (i) the comparative assessment of the difficulty of prompts/queries in image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We present results with several pre-generation/retrieval and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code is publicly available under the CC BY 4.0 license at https://github.com/Eduard6421/PQPP.	This paper introduces PQPP, the first manually annotated benchmark for evaluating the difficulty of prompts in text-to-image generation and retrieval.	This benchmark enables comparative analysis of prompt difficulty across generation and retrieval tasks, facilitating the development of better performance predictors for text-to-image models.	Researchers collected over 1.5M human relevance judgments for 10K prompts/queries, covering both image generation (using Stable Diffusion and GLIDE) and retrieval (using CLIP and BLIP-2).	Low correlation between generation and retrieval performance suggesting a need for task-specific predictors. Fine-tuned CLIP model achieves the highest correlation with human judgments for image generation. Fine-tuned BERT model provides strong baseline for both generation and retrieval, especially for retrieval precision.	Subjectivity in human interpretation of prompts for image generation may introduce variability. Ground-truth image bank for retrieval relies on caption-based pre-filtering potentially missing relevant images.	text-to-image generation, text-to-image retrieval, prompt performance prediction, query performance prediction, benchmark
2406.04675 Report	OVMR: Open-Vocabulary Recognition with Multi-Modal References	Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian	The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, \eg, through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to fuse uni-modal and multi-modal classifiers, with the aim to alleviate issues of low-quality exemplar images or textual descriptions. The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet. Extensive experiments have demonstrated the promising performance of OVMR, \eg, it outperforms existing methods across various scenarios and setups. Codes are publicly available at \href{https://github.com/Zehong-Ma/OVMR}{https://github.com/Zehong-Ma/OVMR}.	This paper presents OVMR, a plug-and-play module that enhances the open-vocabulary recognition capabilities of Vision-Language Models (VLMs) by embedding multi-modal clues (textual descriptions and exemplar images) of novel classes.	Open-vocabulary recognition is challenging because models have no prior knowledge of unseen categories. Existing methods suffer from limitations like inflexibility, time-consuming fine-tuning, ambiguity in textual descriptions, and varying quality of exemplar images.	OVMR consists of two modules: 1) A multi-modal classifier generation module that extracts visual tokens from exemplars using a lightweight visual token generator and dynamically fuses them with textual descriptions using a language encoder. 2) A preference-based fusion module that evaluates the performance of uni-modal and multi-modal classifiers on exemplar images and dynamically fuses them based on their performance.	OVMR achieves comparable performance to state-of-the-art prompt learning methods on 11 classification datasets without requiring fine-tuning. It outperforms existing few-shot adaptation methods, demonstrating significant improvements on complex datasets like ImageNet. In open-vocabulary detection, OVMR surpasses previous methods on the LVIS dataset, showing the effectiveness of multi-modal clue embedding.	The preference-based fusion may have limitations when using very few exemplar images for evaluation. Future work could explore extending OVMR to other open-vocabulary recognition tasks beyond classification and detection.	open-vocabulary recognition, vision-language models, multi-modal learning, few-shot learning, classifier fusion
2406.04662 Report	Evaluating and Mitigating IP Infringement in Visual Generative AI	Zhenting Wang, Chen Chen, Vikash Sehwag, Minzhou Pan, Lingjuan Lyu	The popularity of visual generative AI models like DALL-E 3, Stable Diffusion XL, Stable Video Diffusion, and Sora has been increasing. Through extensive evaluation, we discovered that the state-of-the-art visual generative models can generate content that bears a striking resemblance to characters protected by intellectual property rights held by major entertainment companies (such as Sony, Marvel, and Nintendo), which raises potential legal concerns. This happens when the input prompt contains the character's name or even just descriptive details about their characteristics. To mitigate such IP infringement problems, we also propose a defense method against it. In detail, we develop a revised generation paradigm that can identify potentially infringing generated content and prevent IP infringement by utilizing guidance techniques during the diffusion process. It has the capability to recognize generated content that may be infringing on intellectual property rights, and mitigate such infringement by employing guidance methods throughout the diffusion process without retrain or fine-tune the pretrained models. Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman demonstrate the effectiveness of the proposed defense method. Our data and code can be found at https://github.com/ZhentingWang/GAI_IP_Infringement.	This paper presents a systematic evaluation of the risk of intellectual property (IP) infringement in state-of-the-art visual generative AI models, particularly focusing on their ability to generate images resembling copyrighted characters, even without explicitly mentioning their names. The authors also propose a mitigation method to address this problem.	With the increasing adoption of visual generative AI models, their potential for IP infringement poses serious legal and ethical challenges. This work is important as it highlights the severity of these issues and proposes a method for mitigating them, contributing to the responsible development and deployment of these technologies.	The authors construct a benchmark of popular copyrighted characters and use a large language model (GPT-4) to craft descriptive prompts that could trigger IP infringement without directly naming the characters. They evaluate seven popular text-to-image and text-to-video generation models for their IP infringement rates. For mitigation, they propose a method combining name blocking, large vision-language model (GPT-4V) detection of infringing content, and classifier-free guidance to steer the generation process away from infringing outputs.	The evaluation reveals a high prevalence of IP infringement in both open-source and commercial visual generative AI models, with near 100% infringement rates when character names are explicitly mentioned in prompts. Even with descriptive prompts avoiding character names, the models still exhibit high infringement rates, highlighting the severity of the issue. The proposed mitigation method effectively reduces IP infringement rates while maintaining language-image alignment quality, demonstrating its potential for enabling more responsible content generation.	The evaluation primarily focuses on a limited set of characters and visual generative models. Expanding the scope to encompass a wider range of IP-protected content and models would provide a more comprehensive understanding of the problem. The reliance on large language and vision-language models for mitigation introduces dependencies on the capabilities and potential biases of these models. Exploring alternative or complementary approaches for detecting and mitigating IP infringement could further enhance the robustness of the proposed solution.	ai ethics, intellectual property, visual generative ai, diffusion models, content moderation
2406.04542 Report	M&M VTO: Multi-Garment Virtual Try-On and Editing	Luyang Zhu, Yingwei Li, Nan Liu, Hao Peng, Dawei Yang, Ira Kemelmacher-Shlizerman	We present M&M VTO, a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.	This paper introduces M²TD, a single-stage diffusion-based virtual try-on method for mixing and matching multiple garments with layout control.	M²TD addresses limitations in existing VTO methods, such as preserving intricate garment details, maintaining person identity, and handling multiple garments with layout variations.	M²TD uses a single-stage diffusion model with progressive training to synthesize high-resolution images. It employs a VTO-UDiT architecture to disentangle person features for efficient finetuning and leverages a finetuned PaLI-3 model for layout control.	M²TD outperforms state-of-the-art methods in preserving garment details and layouts, both qualitatively and quantitatively. The proposed method allows for layout control using textual descriptions, enabling edits like tucking or rolling up garments. Efficient finetuning on person features in M²TD preserves individual identity without overfitting to specific clothing items.	M²TD faces challenges with uncommon garment combinations and layout editing that requires inpainting unseen areas. The model does not explicitly incorporate size information for a perfect fit.	virtual try-on, diffusion models, image synthesis, layout control, person identity preservation
2406.04343 Report	Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image	Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi	In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.	This paper introduces Flash3D, an efficient and generalizable method for reconstructing 3D scenes and synthesizing novel views from a single image using a feed-forward network and Gaussian Splatting.	Current methods for monocular scene reconstruction are often computationally expensive, limited in generalization ability, or rely on iterative optimization. Flash3D addresses these limitations by leveraging the efficiency of Gaussian Splatting and the generalization capability of a pre-trained monocular depth estimation model.	Flash3D extends a foundation model for monocular depth estimation by predicting multiple layers of 3D Gaussians for each pixel. The first layer captures visible surfaces guided by the depth estimate, while subsequent layers model occluded and truncated regions. This multi-Gaussian representation, coupled with image padding to capture out-of-view regions, enables the model to reconstruct complete scenes.	Flash3D achieves state-of-the-art novel view synthesis accuracy on RealEstate10k, outperforming methods specifically designed for single-view scene reconstruction. The model demonstrates strong cross-domain generalization, achieving state-of-the-art accuracy on NYU and KITTI datasets without being trained on them. Flash3D exhibits superior performance in view extrapolation compared to existing two-view methods, indicating its ability to effectively model unseen areas.	As a deterministic, regressive model, Flash3D may produce blurry renderings in regions with ambiguity, such as large baselines, occlusions, or backward camera motion. The non-negativity constraint on depth offsets can limit the model's ability to recover scene structure closer to the camera than the initial depth estimate, making it sensitive to failures in the pre-trained depth estimator.	3d scene reconstruction, novel view synthesis, monocular vision, gaussian splatting, deep learning
2406.04342 Report	Learning 1D Causal Visual Representation with De-focus Attention Networks	Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie Zhou, Jifeng Dai	Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code is released at https://github.com/OpenGVLab/De-focus-Attention-Networks.	This paper proposes De-focus Attention Networks to enhance 1D causal visual modeling by addressing the "over-focus" issue, where attention concentrates excessively on a few tokens, hindering diverse feature extraction and gradient flow.	Bridging the performance gap between 1D causal and 2D non-causal vision models is crucial for constructing unified and effective multi-modal models.	The authors introduce De-focus Attention with learnable bandpass filters, incorporating learnable exponential spatial decay and relative position embeddings to create diverse attention patterns. They also employ large scheduled drop path rates and an auxiliary loss on globally pooled features for global understanding tasks to enhance network optimization.	De-focus Attention Networks achieve comparable or even superior performance to 2D non-causal ViTs on ImageNet classification, object detection, and image-text retrieval. The proposed method consistently improves performance across various architectures like ViT, Mamba, and RetNet. The effectiveness of learnable bandpass filters, large drop path rates, and the auxiliary loss is validated through ablation studies.	The work primarily focuses on image-based tasks; further exploration is needed for other visual modalities. Future research can investigate the optimal integration of De-focus Attention Networks with existing multi-modal models.	computer vision, causal modeling, vision transformers, state space models, multi-modal learning
2406.04341 Report	Interpreting the Second-Order Effects of Neurons in CLIP	Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt	We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce "semantic" adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.	This paper presents an interpretability method for understanding the function of individual neurons in CLIP by describing them using text, focusing on their second-order effects (contributions flowing through subsequent attention heads to the output).	Interpreting neurons in CLIP is crucial for understanding model limitations, enabling interventions, and potentially uncovering new capabilities.	The authors introduce a 'second-order lens', analyzing the effect of a neuron's activation flowing through later attention heads to the output. They decompose these second-order effects into sparse sets of text representations, revealing the polysemantic nature of neurons.	Neurons in later CLIP layers have more significant second-order effects. Each neuron's second-order effect is highly selective, significantly impacting only a small subset of images. Neurons exhibit polysemantic behavior, responding to multiple, often unrelated concepts.	The method does not fully analyze the effects of neurons on attention map patterns (queries and keys). Mutual effects and dependencies between neurons within and across layers are not explored.	interpretability, clip, neuron analysis, adversarial examples, zero-shot segmentation
2406.04338 Report	Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion	Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, Yueqi Duan	In recent years, there has been rapid development in 3D generation models, opening up new possibilities for applications such as simulating the dynamic movements of 3D objects and customizing their behaviors. However, current 3D generative models tend to focus only on surface features such as color and shape, neglecting the inherent physical properties that govern the behavior of objects in the real world. To accurately simulate physics-aligned dynamics, it is essential to predict the physical properties of materials and incorporate them into the behavior prediction process. Nonetheless, predicting the diverse materials of real-world objects is still challenging due to the complex nature of their physical attributes. In this paper, we propose \textbf{Physics3D}, a novel method for learning various physical properties of 3D objects through a video diffusion model. Our approach involves designing a highly generalizable physical simulation system based on a viscoelastic material model, which enables us to simulate a wide range of materials with high-fidelity capabilities. Moreover, we distill the physical priors from a video diffusion model that contains more understanding of realistic object materials. Extensive experiments demonstrate the effectiveness of our method with both elastic and plastic materials. Physics3D shows great potential for bridging the gap between the physical world and virtual neural space, providing a better integration and application of realistic physical principles in virtual environments. Project page: https://liuff19.github.io/Physics3D.	Presents Physics3D, a novel framework for learning various physical properties of 3D objects from video diffusion models, enabling the simulation of diverse materials with both elasticity and viscosity.	Current 3D generative models often prioritize surface features over inherent physical properties, limiting their ability to realistically simulate object dynamics.	Employs a viscoelastic Material Point Method (MPM) with elastoplastic and viscoelastic components, and leverages a video generation model (Stable Video Diffusion) to distill physical priors for optimization.	Successfully simulates complex textured objects with realistic and physically plausible movements. Outperforms baselines (PhysDreamer, PhysGaussian, DreamGaussian4D) in terms of realism, damping, and motion consistency, as demonstrated by space-time slice visualizations and video quality metrics. User study confirms Physics3D generates significantly more preferred results regarding quality, realism, and fluency.	Current method requires manual intervention to define movable objects and filling ranges in complex environments. Future work aims to automate these processes using large segmentation models and enhance the physics system modeling.	3d dynamic generation, physical simulation, viscoelastic material point method (mpm), video diffusion model, score distillation sampling (sds)
2406.04337 Report	Coherent Zero-Shot Visual Instruction Generation	Quynh Phung, Songwei Ge, Jia-Bin Huang	Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions	This paper introduces a training-free framework for generating coherent visual instructions from textual instructions, leveraging pre-trained text-to-image diffusion models and large language models (LLMs).	Generating visual instructions is crucial for intuitive understanding and overcoming language barriers. Existing text-to-image methods struggle with maintaining consistency and accurately depicting state transitions across multiple steps.	The framework employs a two-stage process: 1) In-context planning with LLMs to re-caption instructions into descriptive texts, capturing object states and relationships. 2) Adaptive feature-sharing for image generation, using local region constraints from segmentation models and global state similarity constraints from LLMs.	Re-captioning instructions as descriptive text significantly improves coherence and accuracy compared to using raw instructions. Adaptive feature sharing with local and global constraints effectively balances object consistency and necessary variations across steps. The method generates high-quality visual instructions comparable to fine-tuned models, demonstrating the potential of training-free approaches.	The generation quality is limited by the capabilities of current text-to-image models, sometimes failing to accurately depict specific objects or attributes. Future work can explore incorporating temporal reasoning and fine-grained control over object transformations.	visual instruction generation, text-to-image synthesis, diffusion models, large language models, zero-shot learning
2406.04333 Report	BitsFusion: 1.99 bits Weight Quantization of Diffusion Model	Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren	Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.	BitsFusion, a novel weight quantization framework that compresses the weights of UNet from SD-v1.5 to 1.99 bits, achieving a 7.9x smaller model size while maintaining or even improving generation quality.	Large-scale diffusion models are difficult to store and transfer due to their size, especially for resource-constrained devices. Quantization offers a solution by reducing model size without significant architectural changes.	The method involves per-layer quantization error analysis using MSE and CLIP score to develop a mixed-precision strategy. It also introduces techniques like time embedding pre-computing, balanced integer initialization, and alternating optimization for scaling factors. Training employs a two-stage pipeline with distillation and noise prediction, incorporating quantization error-aware time step sampling.	The 1.99-bit quantized model consistently outperforms the full-precision SD-v1.5 across various benchmark datasets and evaluation metrics (TIFA, GenEval, CLIP score). Human evaluation on PartiPrompts shows user preference for BitsFusion over SD-v1.5. BitsFusion outperforms other quantization methods like LSQ, Q-Diffusion, EfficientDM, and Apple-MBP in CLIP score.	The compression of VAE and CLIP text encoder is not explored in this work. The weight quantization techniques could be extended to activation quantization.	diffusion models, quantization, stable diffusion, model compression, image generation
2406.04332 Report	Coarse-To-Fine Tensor Trains for Compact Visual Representations	Sebastian Loeschcke, Dan Wang, Christian Leth-Espensen, Serge Belongie, Michael J. Kastoryano, Sagie Benaim	The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose 'Prolongation Upsampling Tensor Train (PuTT)', a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling' of a learned tensor train representation, creating a sequence of 'coarse-to-fine' tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: https://sebulo.github.io/PuTT_website/	PuTT is a coarse-to-fine tensor train representation that learns compact visual representations through incremental refinement, surpassing previous tensor-based methods in compression, denoising, and handling incomplete data.	Existing tensor-based representations struggle with optimization, getting trapped in local minima and failing to utilize the full compression potential of tensor trains, especially with noisy or incomplete data.	Starting with a low-resolution representation, PuTT iteratively upsamples learned tensor trains using a prolongation operator and TT-SVD for rank control, refining the representation in a coarse-to-fine manner.	PuTT achieves better compression ratios and higher PSNR/SSIM scores than CP, Tucker, and VM decompositions on 2D and 3D data fitting. It excels in denoising, outperforming baselines across varying noise levels and exhibiting superior visual quality. PuTT effectively handles incomplete data, achieving high PSNR/SSIM even with 99% data missing.	Current implementation of TensoRF’s “shrinkage” process is not compatible with QTT. PuTT is not specifically designed as a generative model and is not as effective for tasks like image inpainting over large areas.	tensor networks, tensor train, quantized tensor train, coarse-to-fine learning, visual representation
2406.04330 Report	Parameter-Inverted Image Pyramid Networks	Xizhou Zhu, Xue Yang, Zhaokai Wang, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai	Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks. Our code and models are available at https://github.com/OpenGVLab/PIIP.	This paper introduces Parameter-Inverted Image Pyramid Networks (PIIP), a novel architecture that enhances multi-scale representation in vision backbones while improving computational efficiency.	Traditional image pyramids, while effective, impose significant computational overhead by processing images at multiple resolutions with the same large-scale model. PIIP addresses this challenge by using a parameter-inverted design.	PIIP employs a multi-branch structure with cross-branch interactions and branch merging. Smaller models handle higher-resolution images, while larger models process lower-resolution images. Feature interaction modules facilitate information exchange between branches.	PIIP achieves superior performance compared to traditional image pyramids and single-branch networks in object detection, instance segmentation, semantic segmentation, and image classification tasks while reducing computational costs. When applied to the large-scale InternViT-6B model, PIIP improves performance by 1%-2% on detection and segmentation tasks while using only 40%-60% of the original computation. Extensive ablations provide design guidelines for PIIP, such as prioritizing resolution increase in the largest image branch and limiting the largest model size.	Current experiments focus on adapting PIIP to existing pre-trained models; future work will explore from-scratch pre-training with PIIP. The interaction mechanism between branches can be further improved by incorporating more advanced attention mechanisms.	image pyramid, multi-scale representation learning, vision transformer, computational efficiency, object detection, instance segmentation, semantic segmentation, image classification
2406.04325 Report	ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang	We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos...	This paper introduces ShareGPT4Video, a dataset of 40K video-caption pairs with detailed temporal descriptions generated using GPT4V, and ShareCaptioner-Video, a model fine-tuned on this dataset for efficient high-quality video captioning.	Existing video caption datasets often lack detailed temporal descriptions, limiting the development of large video-language models (LVLMs) and text-to-video models (T2VMs). This work aims to address this gap by providing high-quality, temporally rich video captions.	The authors develop a Differential Sliding-Window Captioning (DiffSW) strategy that leverages GPT4V to generate detailed descriptions of changes between consecutive keyframes. These differential captions are then summarized into a comprehensive video caption using GPT4. This strategy ensures temporal consistency and detailed content description.	ShareGPT4Video, containing 40K high-quality video-caption pairs, significantly improves the performance of existing LVLMs like VideoLLaVA and LLaMA-VID on benchmarks like VideoBench, MVBench, and TempCompass. ShareCaptioner-Video, trained on ShareGPT4Video, enables the efficient generation of high-quality captions for a larger dataset of 4.8M videos, totaling 3000 hours. T2VMs trained on the detailed captions generated by ShareCaptioner-Video demonstrate improved control over semantic content and camera movement in video generation.	The current pipeline does not incorporate audio information, limiting its applicability to conversational scenarios. The dataset relies on videos from existing sources and may contain human faces, requiring users to adhere to the original licenses.	video captioning, large video-language models, text-to-video generation, multi-modal learning, gpt4v
2406.04324 Report	SF-V: Single Forward Video Generation Model	Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren	Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around $23\times$ speedup compared with SVD and $6\times$ speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.	This paper presents the first single-step image-to-video generation model based on a fine-tuned pre-trained video diffusion model, significantly reducing computational cost while maintaining quality.	Video diffusion models, while powerful, suffer from high computational costs due to multi-step denoising processes, hindering their wider deployment.	The authors leverage adversarial training on the latent space of a pre-trained Stable Video Diffusion (SVD) model. They introduce a discriminator with spatial and temporal heads to enhance image quality and motion consistency, respectively.	The model achieves comparable generation quality to SVD with 16 denoising steps, leading to a ~23x speedup. It outperforms existing few-step video generation methods like AnimateLCM in both quality and speed. Ablation studies demonstrate the importance of both spatial and temporal discriminator heads, as well as the impact of noise distribution for optimal performance.	While single-step denoising is achieved, other components like the temporal VAE decoder still contribute to the overall runtime. Future work includes accelerating these components for a truly real-time video generation system.	video generation, diffusion models, adversarial training, single-step generation, latent space
2406.04322 Report	DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data	Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille	We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: https://github.com/qihao067/direct3d.	DIRECT-3D, a diffusion-based 3D generative model for high-quality 3D asset creation from text prompts, trained on noisy and unaligned 'in-the-wild' 3D assets.	Addresses the challenge of data scarcity in large-scale 3D generation by utilizing readily available, albeit noisy, 'in-the-wild' 3D data, overcoming limitations of previous methods reliant on clean, aligned, and limited datasets.	Employs a tri-plane diffusion model with two key innovations: 1) Iterative optimization during training for automatic data cleaning and alignment based on conditional density. 2) Disentanglement of object geometry and color features using separate conditional diffusion models optimized hierarchically.	Achieves state-of-the-art performance in single-class generation, outperforming previous methods by a large margin on standard benchmarks. Exhibits superior performance in text-to-3D generation compared to previous work (Shap-E), demonstrating higher quality, detail, complexity, and realism as per user studies. Serves as an effective 3D geometry prior, significantly improving the consistency of 2D-lifting methods like DreamFusion and mitigating issues like the Janus problem.	Limited compositionality due to the nature of 3D datasets and model design, struggling with novel object combinations. Potential lack of realistic texture details due to the limitations of current large-scale 3D datasets primarily containing synthetic data.	text-to-3d generation, diffusion models, neural radiance fields (nerf), 3d geometry prior, "in-the-wild 3d data"
2406.04321 Report	VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling	Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo	In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.	This paper introduces V2M, a large-scale dataset for video-to-music generation, and proposes VidMuse, a novel method that generates music aligned with video content using a long-short-term modeling approach.	Video-to-music generation is a challenging task with increasing demand due to the growth of social media platforms. Existing datasets are limited in size, diversity, or focus on specific musical forms like MIDI.	The authors construct V2M by collecting and meticulously filtering a large corpus of video-music pairs from YouTube and IMDb. VidMuse utilizes a visual encoder, a long-short-term visual module to capture global and local video features, a music token decoder, and an audio codec decoder to generate music.	VidMuse outperforms existing models in objective metrics for audio quality, diversity, and audio-visual alignment on the V2M benchmark. Subjective user studies confirm that VidMuse generates music that is better aligned with videos and exhibits superior audio quality and musicality compared to baseline methods. Ablation studies demonstrate the efficacy of the long-short-term modeling approach and justify the choice of hyperparameters in VidMuse.	The current codec used in VidMuse limits the audio sampling rate and introduces reconstruction loss. Training large models like VidMuse requires substantial computational resources.	video-to-music generation, music generation, multi-modal learning, deep learning, dataset
2406.04314 Report	Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step	Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, Liang Zheng	Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that preference labels should be tailored to each step's contribution. To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision. Specifically, at each denoising step, we sample a pool of images, find a suitable win-lose pair, and, most importantly, randomly select a single image from the pool to initialize the next denoising step. This step-wise resampler process ensures the next win-lose image pair comes from the same image, making the win-lose comparison independent of the previous step. To assess the preferences at each step, we train a separate step-aware preference model that can be applied to both noisy and clean images. Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency. Code and model: https://rockeycoss.github.io/spo.github.io/	This paper introduces Step-aware Preference Optimization (SPO), a novel post-training approach for aligning text-to-image diffusion models with human preferences by independently evaluating and adjusting the denoising performance at each step.	Existing Direct Preference Optimization (DPO) methods for diffusion models assume a consistent preference order across all diffusion steps, neglecting step-specific denoising performance and leading to misaligned supervision signals.	SPO utilizes a step-aware preference model to assess the quality of denoised samples at each step and a step-wise resampler to ensure independent preference evaluation, removing trajectory-level dependency.	SPO significantly outperforms state-of-the-art DPO methods, including Diffusion-DPO, D3PO, and DDPO, in aligning generated images with complex prompts and enhancing aesthetics, as evaluated by both AI feedback metrics and user studies. The step-wise resampler with random selection significantly improves performance, acting as effective trajectory augmentation. SPO achieves more than 20x faster training efficiency compared to Diffusion-DPO due to the use of more accurate step-aware preference labels.	The step-aware preference model's performance degrades for noisy samples at very large timesteps. Future work includes exploring different step-aware preference model architectures and applying SPO to other diffusion-based generation tasks.	diffusion models, text-to-image generation, preference learning, direct preference optimization, post-training
2406.04312 Report	ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization	Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, Zeynep Akata	Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from "reward hacking" and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time. Code is available at https://github.com/ExplainableML/ReNO.	Introduces ReNO, a novel approach that enhances Text-to-Image (T2I) models at inference by optimizing the initial noise based on human preference reward models.	Existing T2I models struggle to accurately capture intricate details in complex prompts. While fine-tuning with reward objectives is promising, it suffers from 'reward hacking' and generalization issues. ReNO offers an efficient alternative by enhancing models at inference time.	ReNO leverages distilled one-step T2I models to circumvent exploding/vanishing gradients during backpropagation. It optimizes the initial noise using a combination of reward models (HPSv2, PickScore, ImageReward, CLIP) for a limited number of iterations while regularizing the noise to prevent reward hacking.	ReNO significantly improves performance on T2I-CompBench and GenEval, with gains of over 20% in some categories. User studies demonstrate ReNO-enhanced models are preferred nearly twice as often as SDXL and are on par with the proprietary SD3. ReNO outperforms competing multi-step models given the same computational budget, offering an efficient balance between performance and speed.	Limitations in current reward models might hinder further performance improvements. ReNO increases the required VRAM during generation.	text-to-image generation, noise optimization, reward models, one-step diffusion models, compositionality
2406.04309 Report	ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation	Sergey Zakharov, Katherine Liu, Adrien Gaidon, Rares Ambrus	The common trade-offs of state-of-the-art methods for multi-shape representation (a single model "packing" multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.	Proposes ReFiNe (Recursive Field Networks), a method to encode multiple shapes as neural fields into a single network, achieving high compression and reconstruction quality by recursively representing shapes and fusing features across different levels of detail.	Addresses the limitations of current multi-shape representation techniques that compromise detail and accuracy for memory efficiency by enabling high-fidelity representation and compression of multiple shapes within a single network.	Utilizes a recursive autoencoder to represent shapes as octrees, prunes unoccupied voxels, aggregates features spatially and hierarchically, and employs MLPs to decode features into SDF, SDF+RGB, or NeRF representations.	Outperforms DeepSDF and Curriculum DeepSDF in reconstruction accuracy on Thingi32 and ShapeNet150 datasets while achieving comparable performance to ROAD with lower memory usage. Exhibits higher fidelity in reconstructing high-frequency details on the SRN Cars dataset compared to CodeNeRF and SRN. Demonstrates scalability by encoding the Google Scanned Objects dataset (1030 objects) and the RTMV dataset (40 scenes) with high compression rates and good reconstruction quality.	Currently limited to representing bounded scenes. Future work includes extending to unbounded scenes and exploring 3D synthesis using diffusion models.	neural fields, shape representation, compression, recursive networks, 3d reconstruction
2406.04303 Report	Vision-LSTM: xLSTM as Generic Vision Backbone	Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter	Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.	The paper introduces Vision-LSTM (ViL), a novel backbone for computer vision tasks inspired by the success of xLSTM in language modeling. ViL adapts xLSTM's building blocks to vision by processing image patches in an alternating fashion, enabling efficient handling of non-sequential image data.	ViL addresses the limitations of traditional Vision Transformers, particularly their quadratic computational complexity that makes them costly for high-resolution images. ViL's linear complexity makes it well-suited for tasks requiring high-resolution inputs, such as medical imaging and semantic segmentation.	The paper explores various ViL block designs, focusing on multi-directional processing of patch tokens. The final architecture employs alternating mLSTM blocks, with odd blocks processing patches top-to-bottom and even blocks bottom-to-top. Experiments on ImageNet-1K compare different design choices and evaluate performance against existing architectures.	ViL achieves competitive performance on ImageNet-1K classification, outperforming some heavily optimized ViT models, especially at smaller scales. ViL demonstrates robustness to different classification designs, indicating flexibility in adapting to various vision tasks. Despite lacking the inductive bias of convolutions, ViL exhibits competitive performance with CNN-based models like ConvNeXt.	The paper acknowledges that hyperparameters for larger ViL models are not yet fully optimized due to the computational cost of training. The current implementation of ViL lacks custom CUDA kernels, which are expected to further improve its speed.	computer vision, vision transformer, lstm, xlstm, image classification
2406.04295 Report	Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment	Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang	Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditional diffusion model, which is also trained on the source domain to transform target data into synthetic data as a source domain projection. This allows the source model to make predictions without weight adaptation. In this paper, we argue that the domains of the source model and the synthetic data in diffusion-driven TTA methods are not aligned. To adapt the source model to the synthetic domain of the unconditional diffusion model, we introduce a Synthetic-Domain Alignment (SDA) framework to fine-tune the source model with synthetic data. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This process mitigates the potential domain gap between the conditional and unconditional models. Extensive experiments across various models and benchmarks demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment.	This paper proposes Synthetic-Domain Alignment (SDA), a novel Test-Time Adaptation (TTA) framework aligning both source model and target data to a shared synthetic domain derived from a diffusion model.	Existing TTA methods, whether adapting the model to the target domain or vice versa, struggle with real-world data limitations or domain gaps inherent to synthetic data. SDA aims to overcome these by finding a common ground for adaptation.	SDA operates in two stages: 1) A labeled synthetic dataset is generated using a conditional diffusion model, then aligned to the target domain using an unconditional diffusion model. This dataset fine-tunes the source model. 2) Target data is projected into the synthetic domain using the same unconditional diffusion model, enabling the fine-tuned model to make predictions on now domain-aligned data.	SDA consistently outperforms state-of-the-art diffusion-driven TTA methods on both ImageNet-C and ImageNet-W benchmarks across various model architectures. Visualization using Grad-CAM highlights SDA's superior domain alignment compared to methods relying solely on target data projection. Ablation studies confirm the importance of both synthetic data generation and alignment processes within SDA's framework.	SDA, relying on diffusion models, inherits their current limitation of low test speed, requiring further research into faster sampling or distillation. The quality of synthetic data, crucial for SDA's effectiveness, is dependent on the capabilities of the generative diffusion models, an area under active development.	test-time adaptation, diffusion models, domain alignment, synthetic data, robust image classification
2406.04277 Report	VideoTetris: Towards Compositional Text-to-Video Generation	Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui	Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris	VideoTetris, a novel diffusion-based framework enabling compositional text-to-video (T2V) generation by manipulating and composing attention maps of denoising networks spatially and temporally.	Existing T2V models struggle with complex scenes involving multiple objects or dynamic changes in object numbers, especially in long video generation with compositional prompts.	Introduces Spatio-Temporal Compositional Diffusion to precisely follow complex textual semantics. Employs Enhanced Video Data Preprocessing to enhance motion dynamics and prompt understanding. Proposes Reference Frame Attention to improve consistency in auto-regressive video generation.	Achieves state-of-the-art quality in compositional video generation, accurately placing and maintaining multiple objects with distinct attributes. Generates high-quality long videos aligned with progressive compositional prompts, seamlessly integrating new characters and maintaining consistency. Outperforms existing methods in both qualitative and quantitative evaluations, including VBLIP-VQA, VUnidet, and CLIP-SIM.	Current limitations in long video generation models impact the performance of long compositional videos. High computational cost and strong control information from ControlNet hinder object consistency and position control in transitions.	text-to-video generation, diffusion models, compositional generation, long video generation, consistency regularization
2406.04264 Report	MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding	Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu	The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.	MLVU, a new benchmark for evaluating long video understanding in Multimodal Large Language Models (MLLMs), is proposed, featuring long and diverse videos and a range of tasks.	Evaluating the long-video understanding (LVU) performance of MLLMs is crucial yet challenging due to limitations in existing benchmarks, including insufficient video length, lack of diversity in video types and tasks, and inappropriateness for LVU evaluation.	MLVU is constructed with diverse video lengths (3 min to 2+ hours) and genres (movies, surveillance, etc.) and includes 9 LVU-tailored tasks, categorized as holistic, single-detail, and multi-detail understanding, with both multi-choice and free-form generation formats.	Long-video understanding remains challenging for existing MLLMs, with even the best model (GPT-4o) struggling with tasks demanding fine-grained understanding of long videos. A significant performance gap exists between open-source and proprietary models. Context length, image-understanding quality, and the choice of LLM backbone are identified as critical factors influencing LVU performance.	MLVU could be extended to encompass tasks involving high-resolution videos or more specialized tasks like tracking and low-level processing. Potential copyright concerns with using copyrighted video material, despite efforts to minimize infringement.	multimodal learning, long video understanding, benchmarking, large language models, video understanding
2406.04254 Report	GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions	Salvatore Esposito, Qingshan Xu, Kacper Kania, Charlie Hewitt, Octave Mariotti, Lohit Petikam, Julien Valentin, Arno Onken, Oisin Mac Aodha	We introduce a new generative approach for synthesizing 3D geometry and images from single-view collections. Most existing approaches predict volumetric density to render multi-view consistent images. By employing volumetric rendering using neural radiance fields, they inherit a key limitation: the generated geometry is noisy and unconstrained, limiting the quality and utility of the output meshes. To address this issue, we propose GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner. Initially, we reinterpret the volumetric density as a Signed Distance Function (SDF). This allows us to introduce useful priors to generate valid meshes. However, those priors prevent the generative model from learning details, limiting the applicability of the method to real-world scenarios. To alleviate that problem, we make the transformation learnable and constrain the rendered depth map to be consistent with the zero-level set of the SDF. Through the lens of adversarial training, we encourage the network to produce higher fidelity details on the output meshes. For evaluation, we introduce a synthetic dataset of human avatars captured from 360-degree camera angles, to overcome the challenges presented by real-world datasets, which often lack 3D consistency and do not cover all camera angles. Our experiments on multiple datasets show that GeoGen produces visually and quantitatively better geometry than the previous generative models based on neural radiance fields.	This paper presents GeoGen, a new generative model for synthesizing 3D geometry and images from single-view image collections, addressing the limitations of existing neural radiance field-based methods that often produce noisy and unconstrained geometry.	Generating high-quality 3D geometry from single-view images is crucial for various applications, including content creation, virtual reality, and animation, but existing methods struggle to produce accurate and detailed results.	GeoGen utilizes a Signed Distance Function (SDF) network within a StyleGAN generative architecture, augmented with an SDF depth map consistency loss to improve geometric accuracy by aligning 3D points with the SDF's zero-level set.	GeoGen generates visually and quantitatively better geometry than previous neural radiance field-based generative models, as demonstrated through experiments on FFHQ, ShapeNet Cars, and a new synthetic human head dataset. The proposed SDF depth map consistency loss effectively reduces geometric inaccuracies caused by volumetric integration, leading to more precise 3D reconstructions. A new synthetic human head dataset with 360-degree views is introduced, addressing the limitations of existing datasets like FFHQ and providing a valuable resource for training and evaluating 3D generative models.	The reliance on posed images for training, necessitating pose estimation during preprocessing. The potential for increased computational load if the SDF consistency loss is extended to more points along each ray for further geometric refinement.	generative models, 3d reconstruction, signed distance function, neural rendering, single-view reconstruction
2406.04251 Report	Localized Gaussian Point Management	Haosen Yang, Chenhao Zhang, Wenqing Wang, Marco Volino, Adrian Hilton, Li Zhang, Xiatian Zhu	Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. However, we reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) as it is unable to identify all the 3D zones that require point densification, and lacking an appropriate mechanism to handle the ill-conditioned points with negative impacts (occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, with the guidance of image rendering errors. We apply point densification in the identified zone, whilst resetting the opacity of those points residing in front of these regions so that a new opportunity is created to correct ill-conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian Splatting models. Experimental evaluation across both static 3D and dynamic 4D scenes validate the efficacy of our LPM strategy in boosting a variety of existing 3DGS models both quantitatively and qualitatively. Notably, LPM improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, outperforming on challenging datasets such as Tanks & Temples and the Neural 3D Video Dataset.	This paper introduces \fullname{} (\shortname{}), a novel point management approach for 3D Gaussian Splatting (3DGS) that improves scene representation and rendering quality.	Existing point management techniques in 3DGS, like Adaptive Density Control (ADC), rely on global thresholds for point densification, which often overlook under-optimized points and lack a mechanism for handling ill-conditioned points leading to rendering errors.	\shortname{} leverages multiview geometry constraints and image rendering errors to identify error-contributing 3D zones. It then applies localized point manipulations, including point addition in under-populated areas and opacity reset for potentially ill-conditioned points to improve geometry.	\shortname{} achieves state-of-the-art results on challenging datasets like Tanks & Temples and DeepBlending, surpassing previous methods in rendering quality. On the Neural 3D Video Dataset, integrating \shortname{} with SpaceTimeGS yields the best performance, effectively capturing subtle static and dynamic details. Ablation studies demonstrate the efficacy of individual components of \shortname{} and its robustness to sparse training data.	The current point densification method still relies on rules from 3DGS and may not be optimal, requiring further exploration. Future work could focus on extending \shortname{} to address multi-resolution representations in 3DGS.	3d gaussian splatting, point management, novel view synthesis, multiview geometry, neural rendering
2406.04221 Report	Matching Anything by Segmenting Anything	Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, Fisher Yu	The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT). Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings. We propose MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association. Project Page: https://matchinganything.github.io/	This paper introduces MASA, a novel method for learning generalizable instance association from unlabeled images, leveraging the Segment Anything Model (SAM) to enable zero-shot object tracking.	Current object tracking methods rely heavily on labeled domain-specific video datasets, limiting their ability to generalize across domains. MASA addresses this by learning robust instance association from readily available unlabeled images, eliminating the need for costly video annotations.	MASA leverages SAM's rich object segmentation to establish instance-level correspondence. By applying diverse data transformations to unlabeled images and their corresponding SAM outputs, MASA learns discriminative instance representations through contrastive learning. Additionally, a universal MASA adapter is proposed to enable existing open-world segmentation and detection models to track objects effectively.	MASA achieves state-of-the-art zero-shot association performance on various MOT benchmarks, including TAO, BDD100K, and Youtube-VIS, surpassing methods trained with fully annotated in-domain video sequences. The proposed method exhibits strong cross-domain generalization, effectively tracking objects in diverse domains without requiring domain-specific training data. The introduction of the MASA adapter enables seamless integration with existing open-world segmentation and detection models, enhancing their capabilities for tracking any detected object.	One limitation lies in addressing temporal inconsistencies in detection or segmentation results across video frames, leading to flickering effects in video visualization. Another limitation is the lack of a long-term memory system, making the model susceptible to failure in scenarios with severe occlusions.	object tracking, zero-shot learning, instance association, segment anything model (sam), contrastive learning
2406.04103 Report	Multistep Distillation of Diffusion Models via Moment Matching	Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom	We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.	This paper presents Moment Matching Distillation, a new method to distill many-step diffusion models into faster few-step models.	Diffusion models, while powerful generative models for various data modalities, suffer from slow sampling speed due to the iterative nature of the denoising process. This method addresses this limitation, making them more practical for real-world applications.	The method matches conditional expectations of clean data given noisy data throughout the sampling process. It minimizes the L2 distance between moments of the teacher model and a distilled student model, either with an auxiliary denoising model in an alternating optimization scheme or by directly matching gradients of the teacher's loss in parameter space.	Distilled models using 8 sampling steps achieve state-of-the-art results on ImageNet, even surpassing the original many-step teacher model. The method allows for fast generation of high-resolution images directly in image space for large text-to-image models, eliminating the need for autoencoders or upsamplers. The proposed distillation loss provides a clear metric to monitor the progress of the distillation process.	While effective for 8+ sampling steps, the method's performance for 1-2 steps is not as competitive and needs improvement. The paper relies on automated image quality metrics and would benefit from human evaluations to complement the findings.	diffusion models, model distillation, generative models, image generation, moment matching
2406.04101 Report	How Far Can We Compress Instant-NGP-Based NeRF?	Yihang Chen, Qianyi Wu, Mehrtash Harandi, Jianfei Cai	In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically, we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally, we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge, we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100$\times$ and 70$\times$ with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets, respectively. Additionally, we attain 86.7\% and 82.3\% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC.	This paper proposes Context-based NeRF Compression (CNC), a novel framework using context models for compressing NeRF models with multi-resolution hash encoding (e.g., Instant-NGP).	Explicit representations in NeRF, while enabling fast rendering, lead to large storage requirements. CNC addresses this by minimizing information uncertainty in explicit feature encoding through context modeling, enabling storage-efficient NeRF representations.	CNC leverages level-wise and dimension-wise context models to estimate the probability distribution of feature embeddings for entropy reduction. It utilizes hash collision and occupancy grids from Instant-NGP to improve the accuracy of context modeling.	CNC achieves over 100x and 70x size reduction on Synthetic-NeRF and Tanks and Temples datasets respectively, while improving fidelity compared to the Instant-NGP baseline. Compared to BiRF (SOTA NeRF compression), CNC achieves 86.7% and 82.3% size reduction on the two datasets. Ablation studies validate the importance of both level-wise and dimension-wise context models, the coarse-to-fine contextual order, and the hash fusion module for achieving optimal compression performance.	A limitation of CNC is the increased training time compared to models without context models. Future work includes exploring faster implementations of context models and applying the CNC framework to compress dynamic or large-scale NeRFs.	neural radiance field, nerf compression, context modeling, hash encoding, occupancy grid
2406.04032 Report	Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis	Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi	We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks, ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes.	Introduces Zero-Painter, a training-free framework for layout-conditional text-to-image synthesis, generating images from object masks, individual descriptions, and a global text prompt.	Addresses challenges in crafting detailed prompts and limitations of traditional text-to-image models in handling intricate descriptions of multiple objects.	Utilizes a two-stage process: Single Object Generation (SOG) with Prompt-Adjusted Cross-Attention (PACA) for generating individual objects, and Comprehensive Composition (CC) with Region-Grouped Cross-Attention (ReGCA) for seamless object integration based on global prompt and mask-prompt pairs.	Zero-Painter surpasses state-of-the-art methods in preserving textual details and adhering to mask shapes. PACA effectively aligns generated objects with individual prompts and prevents generation outside masked areas. ReGCA ensures coherent background generation and maintains object integrity, even with missing object information in the global prompt.	Zero-Painter faces limitations in handling overlapping masks, leading to less visually coherent outcomes. Future work will focus on addressing overlapping mask challenges and further enhancing the framework's efficiency.	text-to-image synthesis, layout-conditional generation, cross-attention, stable diffusion, image inpainting
2406.03723 Report	Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling	Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang, Pedro Miraldo, Suhas Lohit, Moitreya Chatterjee	Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic, free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences, two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited, and (ii) a lack of semantic understanding of the underlying scenes. To address these issues, we introduce Gear-NeRF, which leverages semantic information from powerful image segmentation models. Our approach presents a principled way for learning a spatio-temporal (4D) semantic embedding, based on which we introduce the concept of gears to allow for stratified modeling of dynamic regions of the scene based on the extent of their motion. Such differentiation allows us to adjust the spatio-temporal sampling resolution for each region in proportion to its motion scale, achieving more photo-realistic dynamic novel view synthesis. At the same time, almost for free, our approach enables free-viewpoint tracking of objects of interest - a functionality not yet achieved by existing NeRF-based methods. Empirical studies validate the effectiveness of our method, where we achieve state-of-the-art rendering and tracking performance on multiple challenging datasets.	Gear-NeRF is a novel dynamic NeRF approach that leverages semantic information from image segmentation models for stratified modeling of 4D scenes, enabling motion-aware sampling for improved novel view synthesis and free-viewpoint object tracking.	Existing dynamic NeRF methods often suffer from reduced reconstruction quality with limited resources and lack semantic understanding of scenes.	Gear-NeRF utilizes a 4D semantic embedding to assign gear levels to scene regions based on motion scales, allowing for differentiated spatio-temporal sampling resolutions.	Achieves state-of-the-art rendering quality on multiple challenging datasets, outperforming baselines in PSNR, SSIM, and LPIPS. Enables free-viewpoint object tracking with simple user prompts like clicks, achieving over 90% mIoU and accuracy on evaluated datasets. Demonstrates the effectiveness of motion-aware sampling and semantic embedding through ablation studies.	Training and inference times are longer compared to some baselines due to the increased sampling density in high-motion regions. Future work includes exploring different gear assignment strategies and optimizing for faster training and inference.	neural radiance fields, dynamic scene reconstruction, novel view synthesis, object tracking, semantic segmentation
2406.03697 Report	Superpoint Gaussian Splatting for Real-Time High-Fidelity Dynamic Scene Reconstruction	Diwen Wan, Ruijie Lu, Gang Zeng	Rendering novel view images in dynamic scenes is a crucial yet challenging task. Current methods mainly utilize NeRF-based methods to represent the static scene and an additional time-variant MLP to model scene deformations, resulting in relatively low rendering quality as well as slow inference speed. To tackle these challenges, we propose a novel framework named Superpoint Gaussian Splatting (SP-GS). Specifically, our framework first employs explicit 3D Gaussians to reconstruct the scene and then clusters Gaussians with similar properties (e.g., rotation, translation, and location) into superpoints. Empowered by these superpoints, our method manages to extend 3D Gaussian splatting to dynamic scenes with only a slight increase in computational expense. Apart from achieving state-of-the-art visual quality and real-time rendering under high resolutions, the superpoint representation provides a stronger manipulation capability. Extensive experiments demonstrate the practicality and effectiveness of our approach on both synthetic and real-world datasets. Please see our project page at https://dnvtmf.github.io/SP_GS.github.io.	Introduces Superpoint Gaussian Splatting (SP-GS), a novel approach for high-fidelity and real-time rendering in dynamic scenes that clusters similar 3D Gaussians into superpoints to reduce computational expense.	Rendering novel views in dynamic scenes is crucial but challenging, with existing NeRF-based methods suffering from low rendering quality and slow inference speed.	SP-GS reconstructs scenes with explicit 3D Gaussians and groups them into superpoints based on similar deformation properties. A deformation network predicts superpoint transformations, enabling efficient rendering. A property reconstruction loss enforces rigidity within superpoints.	Achieves real-time rendering on dynamic scenes, up to 227 FPS at 800x800 resolution for synthetic datasets and 117 FPS at 536x960 for real datasets. Outperforms previous state-of-the-art methods in terms of visual quality and rendering speed on D-NeRF, HyperNeRF, and NeRF-DS datasets. Demonstrates strong extensibility, supporting applications like non-rigid motion prediction, model distillation, pose estimation, and scene editing.	Real-world scene reconstruction relies on sparse point clouds, which can be challenging to obtain accurately, especially for dynamic scenes. Reliance on COLMAP for camera pose estimation in dynamic scenes can be limiting due to its design for static scenes.	3d reconstruction, novel view synthesis, dynamic scene, gaussian splatting, real-time rendering
2406.03586 Report	CountCLIP -- [Re] Teaching CLIP to Count to Ten	Harshvardhan Mestha, Tejas Agrawal, Karan Bania, Shreyas V, Yash Bhisikar	Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model's performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at https://github.com/SforAiDl/CountCLIP.	This paper presents a reproducibility study of 'Teaching CLIP to Count to Ten', focusing on improving the zero-shot counting accuracy of CLIP models while maintaining zero-shot classification performance.	Count-aware VLMs are crucial for enhancing text-to-image and text-to-video models, enabling the generation of accurate content with the correct number of entities.	The study fine-tunes a pre-trained CLIP model using a counting-contrastive loss term alongside the regular contrastive loss. Three novel schemes for balancing the loss function based on class frequencies are introduced: λ_norm, λ_modal, and λ_log. Additionally, the counting objective is modified to contrast against all possible incorrect counts (CountPlus).	The study achieves comparable or better zero-shot counting accuracy than the original work, even with a 640 times smaller training dataset. Balancing the auxiliary loss weight using class frequencies proves effective for improving performance in scenarios with extreme class imbalance and limited data. Changing the counting objective to a multiclass classification loss, combined with balanced lambda, further enhances performance.	While improving overall accuracy, class-balancing schemes might compromise the accuracy of data-rich classes. The models struggle to predict higher-numbered classes (7-10) due to limited training data for these classes. Future work should focus on gathering more diverse training data to address this imbalance.	vision-language models, zero-shot counting, clip, countbench, class imbalance
2406.03520 Report	VideoPhy: Evaluating Physical Commonsense for Video Generation	Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover	Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.	This paper introduces VideoPhysics, a benchmark dataset designed to evaluate the physical commonsense of text-to-video (T2V) generative models.	Current T2V models are being explored as potential physical world simulators. However, their ability to adhere to real-world physics remains unclear, necessitating a dedicated benchmark like VideoPhysics.	The researchers curated 688 text prompts describing interactions between different states of matter (solid-solid, solid-fluid, fluid-fluid). They generated videos from nine different T2V models using these prompts and conducted human evaluations to assess semantic adherence and physical commonsense. An automatic evaluator, VideoCon++, was also developed for scalable testing.	Existing T2V models exhibit a significant lack of physical commonsense, with the best model (Pika) achieving only 19.7% accuracy in both semantic adherence and physical plausibility. Models struggle the most with captions depicting solid-solid interactions, indicating an area for improvement. VideoCon++, a fine-tuned video-language model, outperforms baselines like GPT-4Vision and Gemini-Pro-Vision-1.5 in evaluating semantic adherence and physical commonsense.	The study is limited by the scope of the VideoPhysics dataset and the diversity of T2V models evaluated. Human evaluations, while insightful, are expensive and may not capture the nuances of diverse cultural perspectives on physics.	text-to-video generation, physical commonsense, benchmarking, video understanding, generative ai
2406.03459 Report	LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection	Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang	In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).	This paper introduces LW-DETR, a lightweight detection transformer designed for real-time object detection that outperforms YOLO models.	Real-time object detection is crucial for various applications, and this work explores the potential of transformers in this domain.	LW-DETR employs a simple architecture consisting of a ViT encoder, a projector, and a shallow DETR decoder. It leverages: - Multi-level feature aggregation - Interleaved window and global attentions for efficiency - Window-major feature map organization for faster inference - Effective training techniques like improved loss and pretraining	LW-DETR surpasses previous state-of-the-art real-time detectors, including YOLO-NAS, YOLOv8, and RTMDet, on COCO and other benchmarks. Pretraining on Objects365 significantly boosts LW-DETR performance, demonstrating the benefit of large-scale pretraining for transformer-based detectors. The analysis highlights the impact of NMS post-processing on latency in non-end-to-end detectors and how tuning the score threshold can improve efficiency.	The paper focuses solely on real-time detection, and further research is needed to explore its applicability to open-world detection and other vision tasks. Exploring more complex network architectures, similar to those used in YOLO-NAS, could potentially further enhance LW-DETR's performance.	object detection, real-time, detection transformer, vision transformer (vit), pretraining
2406.03417 Report	CoFie: Learning Compact Neural Surface Representations with Coordinate Fields	Hanwen Jiang, Haitao Yang, Georgios Pavlakos, Qixing Huang	This paper introduces CoFie, a novel local geometry-aware neural surface representation. CoFie is motivated by the theoretical analysis of local SDFs with quadratic approximation. We find that local shapes are highly compressive in an aligned coordinate frame defined by the normal and tangent directions of local shapes. Accordingly, we introduce Coordinate Field, which is a composition of coordinate frames of all local shapes. The Coordinate Field is optimizable and is used to transform the local shapes from the world coordinate frame to the aligned shape coordinate frame. It largely reduces the complexity of local shapes and benefits the learning of MLP-based implicit representations. Moreover, we introduce quadratic layers into the MLP to enhance expressiveness concerning local shape geometry. CoFie is a generalizable surface representation. It is trained on a curated set of 3D shapes and works on novel shape instances during testing. When using the same amount of parameters with prior works, CoFie reduces the shape error by 48% and 56% on novel instances of both training and unseen shape categories. Moreover, CoFie demonstrates comparable performance to prior works when using only 70% fewer parameters.	This paper presents CoFie, a novel local geometry-aware neural surface representation that uses a Coordinate Field to transform local shapes into an aligned coordinate system, simplifying their representation and improving learning.	Existing local-aware neural surface representations often lead to a significant increase in parameters. CoFie addresses this by reducing the complexity of representing local shapes through aligned coordinate frames.	CoFie represents shapes hierarchically using voxels for coarse geometry and MLP-based neural SDFs for fine-grained details within each voxel. It introduces a learnable Coordinate Field to align local shapes and employs quadratic layers in the MLP to enhance the representation of local shape geometry.	CoFie reduces shape error by 48% and 56% on novel instances of both training and unseen shape categories compared to prior arts. CoFie achieves comparable results to prior work while using 70% fewer parameters. CoFie, using a single shared MLP, demonstrates performance comparable to methods that overfit a specific model for each testing shape.	CoFie's reliance on local shapes limits its applicability to shape completion tasks, unlike methods with global shape priors. The fixed cell resolution in CoFie can be problematic when a local cell intersects with thin structures.	neural surface representation, coordinate field, local geometry, shape auto-decoding, implicit neural representations
2406.03303 Report	Learning Visual Prompts for Guiding the Attention of Vision Transformers	Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar	Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.	This paper introduces a self-supervised method for learning visual prompts that guide the attention of pre-trained vision transformers without requiring manual design or fine-tuning.	This is important because it allows for the adaptation of various vision transformers to specific tasks and predictions without relying on dataset biases or manual prompt engineering, which can be limiting.	The method involves training a deep neural prior to generate a visual prompt (patch). This prompt is then applied to random locations on images, and the attention values of the vision transformer are used to calculate a self-supervised loss. This loss guides the optimization of the prompt to attract attention to its location.	Learned prompts effectively guide attention in various vision transformers, including CLIP variants, DeiT, and DINO. Optimal prompts are not universal and vary across models and training paradigms. The method outperforms baselines in keypoint naming tasks, particularly when image context is crucial.	The work primarily explores the prompt's effectiveness in keypoint naming tasks; further investigation into other vision tasks is needed. The impact of prompt size and shape on performance warrants more in-depth analysis.	visual prompting, vision transformers, self-supervised learning, attention mechanisms, prompt optimization
2406.03293 Report	Text-to-Image Rectified Flow as Plug-and-Play Priors	Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, Guosheng Lin	Large-scale diffusion models have achieved remarkable performance in generative tasks. Beyond their initial training applications, these models have proven their ability to function as versatile plug-and-play priors. For instance, 2D diffusion models can serve as loss functions to optimize 3D implicit models. Rectified flow, a novel class of generative models, enforces a linear progression from the source to the target distribution and has demonstrated superior performance across various domains. Compared to diffusion-based methods, rectified flow approaches surpass in terms of generation quality and efficiency, requiring fewer inference steps. In this work, we present theoretical and experimental evidence demonstrating that rectified flow based methods offer similar functionalities to diffusion models - they can also serve as effective priors. Besides the generative capabilities of diffusion priors, motivated by the unique time-symmetry properties of rectified flow models, a variant of our method can additionally perform image inversion. Experimentally, our rectified flow-based priors outperform their diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our method also displays competitive performance in image inversion and editing.	This paper presents the first study on using pretrained rectified flow models as priors for image editing, inversion and 3D generation, similar to how diffusion models are used.	Rectified flow models are gaining popularity for their superior generation quality and efficiency compared to diffusion models, but their potential as priors remained unexplored.	The authors propose three methods: 1) RFDS: analogous to SDS loss in diffusion, 2) iRFDS: utilizes time-symmetry of rectified flow for image inversion, 3) RFDS-Rev: a two-stage method to improve RFDS generation quality.	RFDS-Rev achieves state-of-the-art performance in text-to-3D generation benchmarks among 2D lifting methods, surpassing diffusion priors. iRFDS demonstrates competitive performance in image inversion and editing compared to diffusion-based methods. Rectified flow based priors show faster convergence speed in 3D generation than diffusion priors.	The proposed methods inherit limitations of 2D models, such as difficulty in generating 3D objects with consistent camera poses. The priors might inherit biases present in the pretrained text-to-image models.	rectified flow, diffusion model, generative prior, text-to-3d generation, image inversion
2406.03280 Report	FusionBench: A Comprehensive Benchmark of Deep Model Fusion	Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Do, Dacheng Tao	Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single model in a cost-effective and data-efficient manner. This enables the unified model to take advantage of the original models' strengths, potentially exceeding their performance. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness against distribution shifts. To address this issue, we introduce FusionBench, which is the first comprehensive benchmark dedicated to deep model fusion. FusionBench covers a wide range of tasks, including open-vocabulary image classification, text classification, and text-to-text generation. Each category includes up to eight tasks with corresponding task-specific models, featuring both full fine-tuning and LoRA fine-tuning, as well as models of different sizes, to ensure fair and balanced comparisons of various multi-task model fusion techniques across different tasks, model scales, and fine-tuning strategies. We implement and evaluate a broad spectrum of deep model fusion techniques. These techniques range from model ensemble methods, which combine the predictions to improve the overall performance, to model merging, which integrates different models into a single one, and model mixing methods, which upscale or recombine the components of the original models. FusionBench now contains 26 distinct tasks, 74 fine-tuned models, and 16 fusion techniques, and we are committed to consistently expanding the benchmark with more tasks, models, and fusion techniques. In addition, we offer a well-documented set of resources and guidelines to aid researchers in understanding and replicating the benchmark results. Homepage https://github.com/tanganke/fusion_bench	This paper introduces FusionBench, the first comprehensive benchmark dedicated to evaluating deep model fusion techniques across a variety of tasks and model architectures.	Standardized benchmarks for evaluating deep model fusion are lacking, making it challenging to verify the effectiveness and robustness of these techniques. FusionBench addresses this issue and provides insights into best practices and future research directions.	FusionBench adopts a modular and extensible platform comprising three core modules: Algorithm Module, Model Pool Module, and Task Pool Module. The benchmark covers a wide range of tasks, including image classification, scene understanding, text classification, and text-to-text generation, using various deep learning models like CLIP, ResNet-50, GPT-2, and Flan-T5.	Multi-task model fusion algorithms generally outperform pre-trained models, demonstrating the effectiveness of knowledge transfer. Layer-wise AdaMerging and Weight-Ensembling MoE achieve superior overall performance among the multi-task model fusion methods. Adaptive model fusion methods may be prone to overfitting on certain tasks when the test data distribution is corrupted, indicating the need for further regularization to improve generalization and robustness.	FusionBench currently primarily focuses on evaluating deep model fusion for multi-task learning. Future work includes extending the benchmark by incorporating additional datasets and applications, such as human preference alignment, multi-modal fusion, and reinforcement learning tasks.	deep model fusion, benchmarking, multi-task learning, model ensemble, model merging, model mixing
2406.03215 Report	Searching Priors Makes Text-to-Video Synthesis Better	Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu	Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this problem, in this paper, we reformulate the typical T2V generation process as a search-based generation pipeline. Instead of scaling up the model training, we employ existing videos as the motion prior database. Specifically, we divide T2V generation process into two steps: (i) For a given prompt input, we search existing text-video datasets to find videos with text labels that closely match the prompt motions. We propose a tailored search algorithm that emphasizes object motion features. (ii) Retrieved videos are processed and distilled into motion priors to fine-tune a pre-trained base T2V model, followed by generating desired videos using input prompt. By utilizing the priors gleaned from the searched videos, we enhance the realism of the generated videos' motion. All operations can be finished on a single NVIDIA RTX 4090 GPU. We validate our method against state-of-the-art T2V models across diverse prompt inputs. The code will be public.	This paper introduces a novel search-based text-to-video (T2V) generation pipeline that leverages existing video data to improve the realism of generated videos, particularly in terms of motion dynamics.	Current T2V models often struggle to generate realistic and complex motion sequences. This work aims to address this limitation by utilizing the abundance of real-world motion information available in existing video datasets.	The proposed method involves two main steps: 1) Video Retrieval: Given a text prompt, semantically similar videos are retrieved from a dataset. 2) Tuning and Synthesis: Keyframes are extracted from the retrieved videos, distilled into motion priors, and used to fine-tune a pre-trained T2V model for generating the final video.	The method generates videos with more realistic and temporally coherent motion compared to existing T2V models. User studies confirm that the generated videos are perceived as more realistic and better aligned with the input text prompts. Ablation studies highlight the importance of both the video retrieval and motion distillation components for achieving high-quality results.	The method's reliance on text-based video retrieval can be limiting due to semantic ambiguity and the complex relationship between motion and appearance. The keyframe extraction process may sometimes miss broader dynamic context, focusing solely on detected objects.	text-to-video synthesis, video diffusion models, motion dynamics, video retrieval, motion distillation
2406.03184 Report	Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion	Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Yu Qiao, Lu Sheng	Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: https://costwen.github.io/Ouroboros3D/	Ouroboros3D, a unified image-to-3D creation framework that integrates multi-view image generation and 3D reconstruction into a recursive diffusion process.	Existing two-stage methods for single image-to-3D creation suffer from data bias during inference, which affects the quality of the reconstructed 3D models. This paper aims to address this issue by proposing a unified framework.	Ouroboros3D jointly trains a multi-view diffusion model and a feed-forward reconstruction model through a self-conditioning mechanism. During multi-view denoising, the diffusion model utilizes 3D-aware maps (e.g., color and coordinate maps) rendered from the reconstructed 3D model at the previous timestep as additional conditions.	Ouroboros3D outperforms existing two-stage methods and methods that combine stages during inference in terms of multi-view consistency and 3D reconstruction quality. The proposed framework effectively mitigates data bias by enabling the two stages to adapt to each other's characteristics. Experiments demonstrate superior geometric consistency and detail in the generated 3D models.	The current implementation utilizes 3D Gaussian Splatting as the 3D representation, which might limit its applicability in certain domains. Future work includes exploring mesh-based 3D representations and extending the framework to handle 3D scenes.	3d reconstruction, diffusion models, multi-view synthesis, self-conditioning, image-to-3d generation
2406.03175 Report	Dynamic 3D Gaussian Fields for Urban Areas	Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulò, Lorenzo Porzi, Marc Pollefeys, Peter Kontschieder	We present an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas. Existing works are not well suited for applications like mixed-reality or closed-loop simulation due to their limited visual quality and non-interactive rendering speeds. Recently, rasterization-based approaches have achieved high-quality NVS at impressive speeds. However, these methods are limited to small-scale, homogeneous data, i.e. they cannot handle severe appearance and geometry variations due to weather, season, and lighting and do not scale to larger, dynamic areas with thousands of images. We propose 4DGF, a neural scene representation that scales to large-scale dynamic urban areas, handles heterogeneous input data, and substantially improves rendering speeds. We use 3D Gaussians as an efficient geometry scaffold while relying on neural fields as a compact and flexible appearance model. We integrate scene dynamics via a scene graph at global scale while modeling articulated motions on a local level via deformations. This decomposed approach enables flexible scene composition suitable for real-world applications. In experiments, we surpass the state-of-the-art by over 3 dB in PSNR and more than 200 times in rendering speed.	This paper introduces a novel neural scene representation method for large-scale, dynamic urban areas, enabling efficient and high-quality novel-view synthesis.	Existing methods struggle to achieve both high visual quality and fast rendering speeds in complex urban environments, limiting their use in applications like mixed-reality and simulation.	The method leverages 3D Gaussian primitives for geometry, neural fields for compact and flexible appearance modeling, and a scene graph to handle scene dynamics and transient geometry variations.	The method outperforms state-of-the-art approaches by over 3dB in PSNR and is more than 200x faster in rendering. It effectively reconstructs large-scale urban areas from heterogeneous data sources with varying weather, lighting, and seasons. The approach successfully models non-rigid object motion, such as pedestrians and cyclists, via a deformation head in the scene graph.	The method currently does not model image distortions caused by the physical image formation process, such as rolling shutter or motion blur. The assumption of a pinhole camera model might be suboptimal for certain capturing settings, such as equirectangular cameras.	novel view synthesis, 3d scene representation, neural fields, 3d gaussian splatting, dynamic scenes
2406.03070 Report	A-Bench: Are LMMs Masters at Evaluating AI-generated Images?	Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai	How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at https://github.com/Q-Future/A-Bench.	This paper introduces A-Bench, a diagnostic benchmark designed to evaluate the ability of large multi-modal models (LMMs) to assess AI-generated images (AIGIs).	Accurate and efficient evaluation of AIGIs is crucial, but existing methods using small expert models or traditional benchmarks have limitations. LMMs are increasingly used for evaluation, but their reliability remains questionable.	A-Bench focuses on high-level semantic understanding (A-Bench$^{P1}$) and low-level quality perception (A-Bench$^{P2}$). It includes 2,864 AIGIs from 16 T2I models, paired with question-answers annotated by human experts, and tests 18 LMMs.	LMMs outperform random guessing but lag significantly behind human performance in evaluating AIGIs. LMMs excel at basic semantic understanding but struggle with complex prompts and nuanced quality assessment, particularly in identifying generative distortions. Proprietary LMMs generally outperform open-source LMMs, but both fall short of human-level evaluation.	The choice and number of generative models and LMMs used in A-Bench might limit the generalizability of the results. The rapid evolution of AI might necessitate frequent updates to A-Bench to maintain its relevance.	ai-generated images, image evaluation, large multi-modal models, benchmarking, semantic understanding
2406.03035 Report	Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control	Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, Wenhan Luo	Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35\% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.	This paper presents Follow-Your-Pose v2, a practical and robust framework for character image animation trained on noisy open-sourced videos, addressing limitations of existing methods in handling complex scenarios like multiple characters and body occlusion.	Pose-controllable character video generation is crucial for various applications, including automatic advertising and content creation. Existing methods struggle with incoherent animation in complex scenes and require high-quality training data, limiting their practicality.	FYPv2 employs multi-condition guided generation: optical flow guider for background stability, depth guider for addressing body occlusion in multi-character generation, and reference pose guider for appearance consistency. It's trained on a large-scale noisy dataset from the internet. Additionally, a new benchmark with approximately 4,000 frames is proposed for evaluating multi-character pose animation.	FYPv2 outperforms state-of-the-art methods by over 35% across 2 datasets and on 7 metrics. It demonstrates significant improvement in generating temporally consistent and realistic animations, especially in complex backgrounds and multi-character scenes with body occlusion. The proposed multi-character benchmark provides a valuable resource for evaluating character animation models.	The model's performance might be affected by extreme pose variations or complex actions not well-represented in the training data. Future work could explore incorporating more sophisticated temporal modeling techniques for smoother and more natural animations.	character image animation, pose control, video generation, latent diffusion model, multi-character animation
2406.02968 Report	Adversarial Generation of Hierarchical Gaussians for 3D Generative Model	Sangeek Hyun, Jae-Pil Heo	Most advances in 3D Generative Adversarial Networks (3D GANs) largely depend on ray casting-based volume rendering, which incurs demanding rendering costs. One promising alternative is rasterization-based 3D Gaussian Splatting (3D-GS), providing a much faster rendering speed and explicit 3D representation. In this paper, we exploit Gaussian as a 3D representation for 3D GANs by leveraging its efficient and explicit characteristics. However, in an adversarial framework, we observe that a na\"ive generator architecture suffers from training instability and lacks the capability to adjust the scale of Gaussians. This leads to model divergence and visual artifacts due to the absence of proper guidance for initialized positions of Gaussians and densification to manage their scales adaptively. To address these issues, we introduce a generator architecture with a hierarchical multi-scale Gaussian representation that effectively regularizes the position and scale of generated Gaussians. Specifically, we design a hierarchy of Gaussians where finer-level Gaussians are parameterized by their coarser-level counterparts; the position of finer-level Gaussians would be located near their coarser-level counterparts, and the scale would monotonically decrease as the level becomes finer, modeling both coarse and fine details of the 3D scene. Experimental results demonstrate that ours achieves a significantly faster rendering speed (x100) compared to state-of-the-art 3D consistent GANs with comparable 3D generation capability. Project page: https://hse1032.github.io/gsgan.	This paper introduces the use of 3D Gaussian representation with rasterization for efficient 3D GANs, proposing a hierarchical structure that regularizes the positions and scales of Gaussians to improve training stability and generation quality.	Existing 3D GANs rely heavily on computationally expensive ray casting-based volume rendering. This paper leverages the efficiency of rasterization-based 3D Gaussian Splatting (3D-GS) to accelerate the rendering process significantly.	The authors propose a hierarchical 3D Gaussian representation for the generator in 3D GANs. This hierarchy encourages coarse-to-fine 3D scene modeling by linking the position and scale parameters of Gaussians at adjacent levels. The generator architecture, based on transformer blocks, implements this hierarchy, ensuring stable training and detailed scene generation. Additionally, anchor Gaussians are introduced to further enhance the regularization process.	The proposed method achieves significantly faster rendering speeds (over 100 times faster than state-of-the-art methods) while maintaining comparable generation quality. Experiments on FFHQ and AFHQ-Cat datasets demonstrate the effectiveness of the proposed method in generating realistic and multi-view consistent images. The hierarchical Gaussian representation stabilizes the training process, especially during the early stages, compared to naive 3D Gaussian implementations in GANs.	The number of Gaussians used is fixed and not adapted based on scene complexity, potentially limiting representation capacity for diverse scenes. The scale hierarchy relies on hyperparameters that might need adjustment based on the dataset and resolution.	generative adversarial networks (gans), 3d gaussian splatting, rasterization, hierarchical representation, efficient rendering
2406.02965 Report	Understanding the Impact of Negative Prompts: When and How Do They Take Effect?	Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, Cho-Jui Hsieh	The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative prompts take effect. Our extensive empirical analysis identifies two primary behaviors of negative prompts. Delayed Effect: The impact of negative prompts is observed after positive prompts render corresponding content. Deletion Through Neutralization: Negative prompts delete concepts from the generated image through a mutual cancellation effect in latent space with positive prompts. These insights reveal significant potential real-world applications; for example, we demonstrate that negative prompts can facilitate object inpainting with minimal alterations to the background via a simple adaptive algorithm. We believe our findings will offer valuable insights for the community in capitalizing on the potential of negative prompts.	This paper presents the first comprehensive study uncovering the mechanisms of negative prompts in conditional image generation models, particularly their delayed effect and how they delete concepts through neutralization in latent space.	Despite the popularity of negative prompts for controlling image generation, their intrinsic mechanisms remain largely unexplored, hindering the full utilization of their potential.	The authors conduct extensive empirical analysis, visualizing cross-attention maps across diffusion steps and analyzing estimated noises, to understand when and how negative prompts take effect.	Negative prompts exhibit a delayed effect, influencing generation only after positive prompts render corresponding content. Negative prompts delete objects by neutralizing positive signals in latent space through a subtractive process. Introducing negative prompts too early can lead to the paradoxical generation of the undesired object ("Reverse Activation") due to the interplay of data distribution guidance and prompt guidance.	The study primarily focuses on noun and adjective-based negative prompts, leaving other parts of speech unexplored. Future work can explore incorporating negative prompts during model training as a form of data augmentation.	negative prompts, diffusion models, image generation, controllable inpainting, reverse activation
2406.02923 Report	Rethinking Spiking Neural Networks as State Space Models	Malyaban Bal, Abhronil Sengupta	Spiking neural networks (SNNs) are posited as a biologically plausible alternative to conventional neural architectures, with their core computational framework resting on the extensively studied leaky integrate-and-fire (LIF) neuron design. The stateful nature of LIF neurons has spurred ongoing discussions about the ability of SNNs to process sequential data, akin to recurrent neural networks (RNNs). Despite this, there remains a significant gap in the exploration of current SNNs within the realm of long-range dependency tasks. In this study, to extend the analysis of neuronal dynamics beyond simplistic LIF mechanism, we present a novel class of stochastic spiking neuronal model grounded in state space models. We expand beyond the scalar hidden state representation of LIF neurons, which traditionally comprises only the membrane potential, by proposing an n-dimensional hidden state. Additionally, we enable fine-tuned formulation of neuronal dynamics across each layer by introducing learnable parameters, as opposed to the fixed dynamics in LIF neurons. We also develop a robust framework for scaling these neuronal models to deep SNN-based architectures, ensuring efficient parallel training while also adeptly addressing the challenge of non-differentiability of stochastic spiking operation during the backward phase. Our models attain state-of-the-art performance among SNN models across diverse long-range dependency tasks, encompassing the Long Range Arena benchmark, permuted sequential MNIST, and the Speech Command dataset. Moreover, we provide an analysis of the energy efficiency advantages, emphasizing the sparse activity pattern intrinsic to this spiking model.	This paper proposes Stochastic Spiking Structured State Space Models (S6), a novel class of neuronal models inspired by biological neurons and based on state space models, to improve spiking neural networks' (SNNs) ability to process long-range dependencies in sequential data.	Current SNNs, primarily based on the leaky integrate-and-fire (LIF) neuron model, struggle with long-range dependencies due to their simplified dynamics and limited hidden state representation. This limits their application in tasks like natural language processing and time-series analysis where long-term dependencies are crucial.	The authors replace the scalar hidden state of LIF neurons with an n-dimensional hidden state, enabling richer temporal information encoding. They use a stochastic spiking mechanism instead of the deterministic one in LIF models. They formulate the neuronal dynamics as a convolution operation to enable parallel training and inference, enhancing scalability and energy efficiency.	S6-based SNNs achieve state-of-the-art performance among SNN models on various long-range dependency tasks, including the Long Range Arena benchmark, permuted sequential MNIST, and the Speech Command dataset. The model outperforms traditional non-spiking transformer-based architectures on these tasks, demonstrating its capability to handle long sequences effectively. Analysis shows the model offers significant energy efficiency gains due to the sparse spiking activity inherent in the S6 model.	The model's performance was primarily evaluated on classification-based long-range dependency tasks. Future work can explore its application to generative tasks. To fully realize the energy and power efficiency benefits, future steps could involve deploying the model on edge devices and neuromorphic chips like Intel Loihi 2.	spiking neural networks, state space models, long-range dependencies, sequence modeling, neuromorphic computing
2406.02918 Report	U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation	Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yixuan Yuan	U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U-KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. These endeavours unveil valuable insights and sheds light on the prospect that with U-KAN, you can make strong backbone for medical image segmentation and generation. Project page: https://yes-ukan.github.io/	This paper proposes U-KAN, a novel framework integrating Kolmogorov-Arnold Networks (KANs) into the U-Net architecture, aiming to improve accuracy, efficiency, and interpretability in vision tasks, particularly medical image segmentation.	Existing U-Net variations, while advanced, face limitations in linearly modeling complex patterns and lack interpretability, hindering their reliability and explainability in critical applications like medical imaging.	U-KAN employs a two-phrase encoder-decoder structure. It utilizes convolutional blocks for initial feature extraction and introduces tokenized KAN blocks at higher-level representations to capture complex patterns. Additionally, it leverages skip connections for detailed feature fusion.	U-KAN outperforms state-of-the-art segmentation models, including U-Net++, Att-UNet, and U-Mamba, on BUSI, GlaS, and CVC-ClinicDB datasets, achieving higher IoU and F1 scores. The method demonstrates superior efficiency with fewer parameters and comparable or lower Gflops than most compared methods, except for U-NeXt. As a diffusion model backbone, Diffusion U-KAN exhibits superior generative capabilities compared to conventional U-Net-based diffusion models, achieving better FID and IS scores on the tested medical datasets.	The paper primarily focuses on medical image analysis, exploring segmentation and generation tasks. Further research is needed to validate its effectiveness in broader vision applications. The impact of different KAN layer configurations and their interplay with other architectural components warrants further investigation to unlock the full potential of U-KAN.	u-net, kolmogorov-arnold networks, medical image segmentation, image generation, diffusion models
2406.02917 Report	A comprehensive and FAIR comparison between MLP and KAN representations for differential equations and operator networks	Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, George Em Karniadakis	Kolmogorov-Arnold Networks (KANs) were recently introduced as an alternative representation model to MLP. Herein, we employ KANs to construct physics-informed machine learning models (PIKANs) and deep operator models (DeepOKANs) for solving differential equations for forward and inverse problems. In particular, we compare them with physics-informed neural networks (PINNs) and deep operator networks (DeepONets), which are based on the standard MLP representation. We find that although the original KANs based on the B-splines parameterization lack accuracy and efficiency, modified versions based on low-order orthogonal polynomials have comparable performance to PINNs and DeepONet although they still lack robustness as they may diverge for different random seeds or higher order orthogonal polynomials. We visualize their corresponding loss landscapes and analyze their learning dynamics using information bottleneck theory. Our study follows the FAIR principles so that other researchers can use our benchmarks to further advance this emerging topic.	This work systematically compares Kolmogorov-Arnold Networks (KANs) to Multilayer Perceptrons (MLPs) for solving differential equations and operator learning problems, focusing on their accuracy, efficiency, and learning dynamics.	Despite the popularity of MLPs in scientific machine learning, they have limitations in interpretability and efficiency. KANs offer a potentially more interpretable and accurate alternative, making their systematic evaluation crucial.	The authors benchmark various KAN architectures against MLP-based models (PINNs, DeepONets) on several problems: function approximation, Hamiltonian dynamics, Helmholtz equation, Navier-Stokes equation, Allen-Cahn equation, Burgers' equation, and Darcy flow. They analyze accuracy, training time, and use the Information Bottleneck theory to understand learning dynamics.	Modified Chebyshev KANs (cPIKANs) show comparable accuracy to PINNs, sometimes outperforming them, but with increased training time. cPIKANs are more robust to noise than DeepONets in operator learning tasks but require more computational resources. Both PINNs and cPIKANs exhibit similar learning dynamics through fitting, diffusion, and total diffusion stages, as revealed by the Information Bottleneck analysis.	Training cPIKANs is computationally more expensive than PINNs, especially for high-dimensional problems. cPIKANs exhibit sensitivity to initialization and choice of polynomial order, sometimes leading to instability.	scientific machine learning, kolmogorov-arnold networks, physics-informed neural networks, operator learning, information bottleneck
2406.02915 Report	Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models	Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, Feng Liu	It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions generated by a large language model can significantly enhance zero-shot performance. However, in this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image, and then we theoretically validate this finding. Thus, we present a method called weighted visual-text cross alignment (WCA). This method begins with a localized visual prompting technique, designed to identify local visual areas within the query image. The local visual areas are then cross-aligned with the finer descriptions by creating a similarity matrix using the pre-trained VLM. To determine how well a query image aligns with each category, we develop a score function based on the weighted similarities in this matrix. Extensive experiments demonstrate that our method significantly improves zero-shot performance across various datasets, achieving results that are even comparable to few-shot learning methods.	This paper proposes Weighted Visual-Text Cross Alignment (WCA), a method that improves zero-shot visual classification by aligning fine-grained text descriptions with local visual areas of an image using localized visual prompting.	Aligning whole images with fine-grained descriptions can be suboptimal, as such descriptions often better match specific image regions. WCA addresses this limitation by focusing on local alignment, leading to improved performance.	WCA first segments an image into local patches using localized visual prompting. Then, it cross-aligns these patches with fine-grained text descriptions generated by a large language model for each category, creating a similarity matrix. Finally, a weighted aggregation scheme, considering the relevance of both patches and descriptions, determines the final image-category alignment score.	WCA significantly outperforms existing zero-shot methods on various benchmarks, including ImageNet, CUB, and Oxford Pets. The method shows particularly strong improvements on tasks where standard CLIP models struggle, indicating its effectiveness in handling complex visual recognition scenarios. WCA even achieves performance comparable to few-shot learning methods, highlighting its potential for learning with limited labeled data.	WCA might be less effective for tasks requiring holistic image understanding rather than object-centric recognition. The method's performance could be hindered when images contain multiple objects of varying sizes, as patch weights might not always accurately capture the importance of smaller objects.	visual-text cross alignment, zero-shot classification, vision-language models, visual prompting, large language models
2406.02881 Report	Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter	Peng Xing, Ning Wang, Jianbo Ouyang, Zechao Li	The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high fidelity and high-efficiency requirements. Their main bottleneck lies in the prompt image encoder, which produces weak alignment signals with the text-to-image model and significantly increased model size. Towards this end, we propose a lightweight Inv-Adapter, which first extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then embed them efficiently into the base text-to-image model by carefully designing a lightweight attention adapter. We conduct extensive experiments to assess ID fidelity, generation loyalty, speed, and training parameters, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.	This paper proposes Inv-Adapter, a lightweight method for high-fidelity ID customization in text-to-image generation, utilizing DDIM image inversion to extract diffusion-domain representations of ID images and embedding them efficiently via a lightweight attention adapter.	Existing personalization methods struggle to achieve both high fidelity and high efficiency in ID customization generation due to weak alignment signals and increased model size from prompt image encoders.	Inv-Adapter extracts diffusion features from pre-trained text-to-image models via DDIM inversion and injects them into both self and cross attention layers using a lightweight Embedded Attention Adapter.	Inv-Adapter achieves state-of-the-art performance in generating faithful, detailed, and high-fidelity images while maintaining high efficiency. It effectively preserves ID information while aligning with textual prompts, demonstrated by quantitative metrics (CLIP-I, DINO, FACE-SIM) and qualitative results. The lightweight design results in smaller training parameters and faster generation speed compared to other methods.	The current training dataset lacks diversity in face poses, limiting the model's ability to generalize to different viewpoints. Image inversion introduces a speed bottleneck, which could be addressed in future work with model acceleration techniques like LCM.	id customization generation, text-to-image generation, image inversion, attention adapter, diffusion models
2406.02820 Report	ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models	Kiymet Akdemir, Pinar Yanardag	Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children's literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.	Introduces ORACLE, a novel framework that leverages mutual information to ensure consistent character generation across diverse settings from a single text prompt in text-to-image diffusion models.	Addresses the critical challenge of maintaining visual consistency in character generation across different contexts, which is crucial for storytelling, brand identity, and character recognition in various creative applications.	1. Generates a grid of candidate character images from a text prompt using a pre-trained diffusion model. 2. Identifies and removes inconsistent images from the candidate set using mutual information-based filtering. 3. Trains a personalized model (e.g., LoRA) on the refined image set to generate consistent characters in various contexts.	ORACLE outperforms existing methods in generating characters with consistent visual identities across diverse settings, as demonstrated through qualitative and quantitative comparisons. User study confirms that ORACLE produces characters that are both consistent and relevant to the given text prompts. The framework is highly versatile and applicable for various creative tasks like story illustration, object generation, and 3D character modeling.	Despite consistent input images, the underlying diffusion model may still introduce minor inconsistencies in details like clothing. The current implementation requires manual cropping of the generated character grid, which can be automated in future work.	text-to-image synthesis, diffusion models, character consistency, mutual information, personalization
2406.02720 Report	3D-HGS: 3D Half-Gaussian Splatting	Haolin Li, Jinyang Liu, Mario Sznaier, Octavia Camps	Photo-realistic 3D Reconstruction is a fundamental problem in 3D computer vision. This domain has seen considerable advancements owing to the advent of recent neural rendering techniques. These techniques predominantly aim to focus on learning volumetric representations of 3D scenes and refining these representations via loss functions derived from rendering. Among these, 3D Gaussian Splatting (3D-GS) has emerged as a significant method, surpassing Neural Radiance Fields (NeRFs). 3D-GS uses parameterized 3D Gaussians for modeling both spatial locations and color information, combined with a tile-based fast rendering technique. Despite its superior rendering performance and speed, the use of 3D Gaussian kernels has inherent limitations in accurately representing discontinuous functions, notably at edges and corners for shape discontinuities, and across varying textures for color discontinuities. To address this problem, we propose to employ 3D Half-Gaussian (3D-HGS) kernels, which can be used as a plug-and-play kernel. Our experiments demonstrate their capability to improve the performance of current 3D-GS related methods and achieve state-of-the-art rendering performance on various datasets without compromising rendering speed.	This paper introduces 3D Half-Gaussian Splatting (3D-HGS), a novel plug-and-play reconstruction kernel designed to enhance the accuracy of 3D scene reconstruction in neural rendering. The key innovation lies in splitting the traditional 3D Gaussian kernel into two halves, each with learnable opacity, enabling better representation of discontinuities in shape and color often found at edges, corners, and texture-rich areas.	Accurately reconstructing 3D scenes with photorealism is crucial for various applications such as VR, media production, and autonomous driving. While existing methods like 3D Gaussian Splatting (3D-GS) have achieved impressive speed and quality, they struggle with discontinuities. This work addresses this limitation, aiming for state-of-the-art performance without sacrificing rendering speed.	The method starts with a 3D scene representation obtained through Structure from Motion. Instead of 3D Gaussians, 3D Half-Gaussians, defined by a splitting plane and individual opacities for each half, are used as reconstruction kernels. These are projected onto the image plane and blended to synthesize novel views. The parameters of these kernels, including the splitting plane normal and opacities, are optimized by minimizing a loss function comparing rendered images to ground truth.	3D-HGS, when implemented within existing 3D-GS frameworks, demonstrates state-of-the-art rendering performance on datasets like Mip-NeRF360, Tanks & Temples, and Deep Blending. The method excels at capturing fine-grained details, high-frequency textures, complex lighting, and shadow areas, surpassing previous state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS) and qualitative visual comparisons. Ablation studies confirm the effectiveness of the 3D Half-Gaussian kernel compared to other kernel choices and highlight the impact of training strategies, including the learning rate for the normal of the splitting plane.	Despite improvements in novel view synthesis, 3D-HGS still faces challenges with geometry reconstruction in featureless areas, requiring further research. The ethical implications of generating highly realistic 3D scenes, including potential misuse for disinformation and privacy violations, are acknowledged, highlighting the need for responsible development and deployment of such technology.	3d reconstruction, neural rendering, gaussian splatting, novel view synthesis, discontinuity modeling
2406.02549 Report	Dreamguider: Improved Training free Diffusion-based Conditional Generation	Nithin Gopalakrishnan Nair, Vishal M Patel	Diffusion models have emerged as a formidable tool for training-free conditional generation.However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introduced minimal compute methods for linear inverse problems, a generic lightweight guidance solution to both linear and non-linear guidance problems is still missing. To this end, we propose Dreamguider, a method that enables inference-time guidance without compute-heavy backpropagation through the diffusion network. The key idea is to regulate the gradient flow through a time-varying factor. Moreover, we propose an empirical guidance scale that works for a wide variety of tasks, hence removing the need for handcrafted parameter tuning. We further introduce an effective lightweight augmentation strategy that significantly boosts the performance during inference-time guidance. We present experiments using Dreamguider on multiple tasks across multiple datasets and models to show the effectiveness of the proposed modules. To facilitate further research, we will make the code public after the review process.	This paper introduces Dreamguider, a method for inference-time guidance in diffusion models that avoids computationally expensive backpropagation through the network, enabling zero-shot conditional generation.	Existing inference-time guidance techniques for diffusion models often require heavy computations and case-by-case parameter tuning, limiting their practicality. Dreamguider addresses these limitations with a lightweight and generic approach.	Dreamguider regulates the gradient flow during inference using a time-varying factor and employs a gradient-dependent scaling factor for automatic parameter tuning. It also introduces DiffuseAugment, a differentiable augmentation strategy, to enhance sampling quality.	Dreamguider achieves superior performance on linear inverse problems (e.g., super-resolution, colorization) compared to DPS and MGD. For non-linear tasks (e.g., sketch-to-face, ID guidance), Dreamguider outperforms existing methods like Freedom and MGD in terms of image quality and sampling speed. The proposed empirical scaling factor and DiffuseAugment effectively enhance the performance of zero-shot conditional generation.	Direct application to latent diffusion models for linear inverse problems is limited due to VAE reconstruction errors. While the empirical scaling factor demonstrates effectiveness, a comprehensive mathematical analysis for optimal parameter estimation is left for future work.	diffusion models, inference-time guidance, zero-shot learning, conditional generation, image restoration
2406.02548 Report	Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation	Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan	Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.	Proposes Open-YOLO 3D, a fast and accurate open-vocabulary 3D instance segmentation method using 2D object detection from multi-view RGB images.	Existing open-vocabulary 3D instance segmentation methods are computationally expensive and slow, hindering real-world applications requiring fast and accurate predictions.	Generates class-agnostic 3D masks and associates them with text prompts using a 2D open-vocabulary object detector to create low-granularity label maps for each frame, then uses these maps to predict labels for the 3D masks.	Achieves state-of-the-art performance on ScanNet200 and Replica datasets. Up to 16x faster than existing methods. Demonstrates the effectiveness of 2D object detection for open-vocabulary 3D instance segmentation.	Relies solely on a 3D proposal network, potentially missing small objects. Could benefit from incorporating fast 2D instance segmentation for enhanced 3D proposal generation.	3d instance segmentation, open-vocabulary, 2d object detection, multi-view, zero-shot learning
2406.02547 Report	Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning	Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou	Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, VisInContext is complementary to existing methods for increasing in-context text length and enhances document understanding capabilities, showing great potential in document QA tasks and sequential document retrieval.	This paper proposes Text2Vis, a novel method to increase the in-context text length of Multimodal Large Language Models (MLLMs) by rendering text as images, thereby reducing computational cost.	Training MLLMs with long in-context lengths is crucial for complex tasks like document understanding but is hindered by high GPU memory and computational costs.	Text2Vis converts long text into images and processes them using a visual encoder alongside regular images. It introduces Token Masking and Text-Centric Contrastive Learning (TCCL) to ensure the model effectively learns text semantics from these rendered images.	Text2Vis significantly improves performance on multimodal downstream tasks by increasing the effective in-context text length. It achieves comparable performance to raw text inputs when using rendered text images for text-only in-context examples. Text2Vis significantly improves the model's document understanding abilities on tasks like DocVQA and OCRVQA.	Currently, Text2Vis uses a fixed image size even for short texts, leading to potential inefficiencies. Future work will explore dynamically adjusting image sizes to optimize token usage.	multimodal learning, large language models, document understanding, in-context learning, computational efficiency
2406.02541 Report	Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting	Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen	Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.	This paper introduces Video-3DGS, a two-stage 3D Gaussian Splatting based framework that reconstructs and refines dynamic monocular video scenes, leading to significant improvements in both video reconstruction and editing.	Existing zero-shot video diffusion models face challenges in achieving high temporal consistency due to their limited understanding of individual video scenes. Video-3DGS aims to address this limitation by leveraging the per-scene representation power of 3DGS.	In the first stage, Video-3DGS utilizes an improved COLMAP (MC-COLMAP) to generate foreground and background point clouds, which are used to initialize and optimize two sets of 3D Gaussians. These 3D Gaussians, along with a 2D learnable parameter map, enable high-fidelity video reconstruction. In the second stage, the pre-optimized Video-3DGS serves as a plug-and-play refiner for zero-shot video editors, enhancing temporal consistency by fine-tuning color and opacity parameters while maintaining structural fidelity.	Video-3DGS significantly outperforms NeRF-based and 3DGS-based state-of-the-art methods in video reconstruction quality and training efficiency on the DAVIS dataset. It consistently enhances temporal consistency and overall editing quality across three off-the-shelf video editors (Text2Video-Zero, TokenFlow, and RAVE) on 58 challenging monocular videos. User studies confirm a strong preference for Video-3DGS-refined edits over baseline outputs, highlighting its effectiveness in improving video editing quality.	Video-3DGS faces challenges when foreground objects exhibit extremely large motion, and it may struggle with edits requiring significant changes to object shapes. Future work includes exploring the potential of Video-3DGS as a fundamental framework for 4D novel view synthesis.	video editing, 3d gaussian splatting, temporal consistency, zero-shot learning, video reconstruction
2406.02539 Report	Parrot: Multilingual Visual Instruction Tuning	Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye	The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.	This paper introduces MAME, a novel method to enhance multilingual capabilities in Multimodal Large Language Models (MLLMs) by leveraging textual guidance to drive visual token alignment at the language level, addressing the issue of English-centric bias in training data.	Multilingual capability is crucial for MLLMs to cater to diverse linguistic groups and ensure equitable access to AI benefits across different regions and languages.	MAME utilizes a Mixture-of-Experts (MoE) module to convert English-biased visual features into language-specific embeddings based on input language, enabling the model to better understand and generate responses in various languages.	MAME achieves state-of-the-art performance on both MMBench and MMMB multilingual benchmarks, surpassing existing methods in most languages. The model shows competitive performance across a broad range of multimodal tasks, indicating its effectiveness beyond multilingual capabilities. MAME achieves significant improvements with significantly less multilingual training data compared to other models, demonstrating its efficiency in low-resource scenarios.	MLLMs, including MAME, may still face challenges in accurately understanding complex language-specific contexts and may exhibit hallucinations. The current implementation of MAME relies on CLIP for visual processing, limiting its ability to process high-resolution images effectively.	multimodal large language models, multilingual alignment, mixture-of-experts, textual guidance, visual token alignment
2406.02535 Report	Enhancing 2D Representation Learning with a 3D Prior	Mehmet Aygün, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan	Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.	This paper introduces a novel method to enhance the robustness of self-supervised learning (SSL) by explicitly incorporating 3D structural information during training.	Current SSL methods primarily focus on 2D image collections, leading to representations that may over-rely on texture and exhibit limited robustness. This work draws inspiration from the human visual system's use of 3D cues for robust understanding.	The method leverages a proxy 3D reconstruction task. A pre-trained SSL backbone extracts image representations, which are then used to generate 3D triplane features. Volume rendering reconstructs the input image and its depth, using pseudo-depth obtained from pre-trained monocular depth models. A distillation loss from the frozen SSL backbone prevents forgetting of previously learned features.	The 3D-aware representations demonstrate improved robustness on benchmarks like ImageNet-Rendition, ImageNet-Sketch, and PUG, outperforming baselines that lack 3D priors. The method does not compromise performance on other downstream tasks, showing comparable or improved results on ImageNet classification, iNat21 fine-grained classification, and NYU-DepthV2 depth estimation. Analysis confirms an increased shape bias in the learned representations, supporting the hypothesis that incorporating 3D knowledge encourages more robust and shape-centric feature learning.	The method relies on pseudo-depth maps during training, which could introduce limitations depending on the accuracy and generalization of the pre-trained depth estimation model. Future work could explore incorporating semantic information during training or investigating alternative 3D representations beyond triplanes.	self-supervised learning, 3d reconstruction, robustness, shape bias, representation learning
2406.02528 Report	Scalable MatMul-free Language Modeling	Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian	Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github.com/ridgerchu/matmulfreellm.	This paper introduces the first scalable MatMul-free language model (MatMul-free LM) that eliminates matrix multiplication operations by utilizing additive operations in dense layers and element-wise Hadamard products for self-attention-like functions.	Matrix multiplication (MatMul) is a computationally expensive operation that dominates the cost of large language models (LLMs), particularly as models scale to larger sizes. This work aims to address this challenge by developing a more efficient architecture.	The paper proposes a novel architecture called MatMul-free LM, which replaces MatMul operations with ternary accumulations in dense layers and employs a MatMul-free token mixer based on a modified Gated Recurrent Unit (GRU). The model is trained using a surrogate gradient method and a large learning rate.	The MatMul-free LM achieves performance on par with state-of-the-art Transformers while using significantly less memory during inference. The scaling law analysis reveals that the performance gap between the MatMul-free LM and full-precision Transformers narrows as model size increases. A custom FPGA implementation demonstrates the hardware efficiency of the MatMul-free LM, achieving brain-like efficiency at billion-parameter scales.	The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100B+ parameters) due to computational constraints. Further research is needed to explore the potential of MatMul-free architectures for other natural language processing tasks beyond language modeling.	language modeling, matrix multiplication, ternary networks, fpga acceleration, efficient deep learning
2406.02511 Report	V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation	Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, Wei Yang	In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.	Presents V-Express, a novel method for generating high-quality portrait videos with synchronized audio, balancing control signals of varying strengths through progressive training and conditional dropout operations.	Addresses the challenge in portrait video generation where weaker control signals, like audio, are often overshadowed by stronger ones (e.g., pose, reference image), limiting control and synchronization.	Utilizes a Latent Diffusion Model (LDM) with ReferenceNet, V-Kps Guider, and Audio Projection to handle control inputs. Progressive training gradually incorporates control, while conditional dropout prevents shortcut learning from dominant signals.	Effectively generates high-quality portrait videos synchronized with audio input. Maintains consistency in facial identity and pose guided by reference images and V-Kps. Demonstrates a balanced approach to integrating multiple control signals with varying strengths.	Limited multilingual support due to the English-centric Wav2Vec2 audio encoder. Slow generation speed due to the autoregressive diffusion process for multi-frame generation.	portrait video generation, audio-driven video generation, latent diffusion model, control signal balancing, conditional dropout
2406.02509 Report	CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat	Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/	This paper presents CamCo, a novel image-to-video generation framework that enables fine-grained camera control and ensures 3D consistency in generated videos.	Controlling camera motion is crucial for cinematic expression and practical applications of generated videos, but existing video generation models often lack this capability.	CamCo leverages Plücker coordinates for accurate camera pose representation and integrates an epipolar constraint attention module to enforce geometric consistency across frames. The model is trained on a dataset augmented with dynamic videos and their camera pose annotations.	CamCo significantly improves 3D consistency and camera control accuracy compared to previous state-of-the-art methods. The model demonstrates superior visual quality, as evidenced by FID and FVD metrics. CamCo effectively generates plausible object motion in addition to camera ego-motion.	The model currently cannot generate complex camera intrinsic changes (e.g., dolly zoom). The output video length and resolution are limited, potentially restricting its application in large-scale scenes.	video generation, camera control, 3d consistency, diffusion models, epipolar constraint
2406.02507 Report	Guiding a Diffusion Model with a Bad Version of Itself	Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine	The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.	This paper introduces "autoguidance," a novel method for enhancing image quality in diffusion models by guiding generation using a smaller, less-trained version of the model itself, rather than an unconditional model like in classifier-free guidance (CFG).	This is important because it provides disentangled control over image quality and variation, addressing limitations of CFG, which entangles these aspects and can lead to over-simplified image compositions.	The method leverages the observation that score matching in diffusion models leads to over-emphasis of low-probability regions. By using a weaker model trained on the same task and data distribution, autoguidance identifies and reduces errors in the stronger model's predictions, leading to improved sample quality without sacrificing variation.	Autoguidance achieves significant FID and DINO improvements on ImageNet-512 and ImageNet-64, setting new records for these datasets. It allows for independent control of image quality and variation, enabling the generation of diverse and high-fidelity images. The method can be applied to both conditional and unconditional diffusion models, substantially improving the quality of unconditional generation, which is typically poor.	One limitation is the need for early snapshots of smaller models for optimal guidance, which might not be readily available for all large-scale generators. Future work could explore formalizing the conditions under which autoguidance is beneficial and developing better guidelines for selecting the best guiding model.	diffusion models, image generation, classifier-free guidance, autoguidance, image quality
2406.02495 Report	GenS: Generalizable Neural Surface Reconstruction from Multi-View Images	Rui Peng, Xiaodong Gu, Luyang Tang, Shihe Shen, Fanqi Yu, Ronggang Wang	Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.	This paper presents GenS, an end-to-end generalizable neural surface reconstruction model that efficiently reconstructs detailed 3D structures from multi-view images without requiring expensive per-scene optimization.	Current neural surface reconstruction methods suffer from lengthy per-scene optimization and lack of generalization to new scenes, limiting their applicability.	GenS leverages a generalized multi-scale volume to represent scenes efficiently. It introduces multi-scale feature-metric consistency for robust multi-view matching and a view contrast loss to improve reconstruction accuracy for sparsely viewed regions.	GenS outperforms state-of-the-art generalizable methods and even some per-scene optimization methods on DTU dataset. The model demonstrates strong generalization ability on BlendedMVS dataset. Ablation studies confirm the effectiveness of each proposed component.	The model struggles with scenes containing large camera motion. Future work will focus on handling challenging scenarios with improved aggregation features.	neural surface reconstruction, generalizable model, multi-view consistency, multi-scale volume, view contrast loss
2406.02485 Report	Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation	Jiajun Wang, Morteza Ghahremani, Yitong Li, Björn Ommer, Christian Wachinger	Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.	Stable-Pose, a novel adapter model for controllable text-to-image (T2I) diffusion models, improves pose control in human image synthesis by employing a coarse-to-fine attention masking strategy within a vision transformer (ViT).	Current T2I models struggle with accurate pose guidance, particularly in complex poses (side or rear views). Stable-Pose addresses this by effectively aligning pose representation during image synthesis.	Stable-Pose integrates a trainable ViT unit into pre-trained T2I models like Stable Diffusion. It utilizes a coarse-to-fine masking approach in the self-attention mechanism to focus on pose-related regions and a pose-mask guided loss for enhanced pose fidelity.	Stable-Pose achieves superior pose accuracy (AP and CAP) compared to state-of-the-art methods on five datasets. The model exhibits robust performance in challenging scenarios like side/back poses and multiple figures. Stable-Pose maintains comparable image quality (FID and KID) and text-image alignment (CLIP score) to other methods.	Stable-Pose's inference time is slightly longer due to the ViT's self-attention mechanism. The model's performance with conditions other than pose (e.g., edge maps) has yet to be evaluated.	text-to-image generation, diffusion models, pose control, vision transformer, attention mechanism
2406.02461 Report	RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting	Qi Wang, Ruijie Lu, Xudong Xu, Jingbo Wang, Michael Yu Wang, Bo Dai, Gang Zeng, Dan Xu	The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at https://qwang666.github.io/RoomTex/.	Proposes RoomTex, a coarse-to-fine 3D scene texturing framework, for generating high-fidelity and style-consistent textures for untextured compositional scene meshes.	Automating scene texturing is important for various industries (gaming, filming, AR/VR) but challenging due to style inconsistency and occlusions between objects in a scene.	Uses a coarse stage to generate a style-consistent room panorama from a panoramic depth map and text prompt. Then, a fine stage refines the panorama and iteratively textures each object from different viewpoints using depth-guided inpainting and an edge detection module for RGB-depth alignment.	Generates high-quality, diverse, and style-consistent room textures on par with those in professional datasets. Supports interactive fine-grained texture control, enabling users to edit specific areas using sketches or text descriptions. Enables flexible scene editing by leveraging the compositional nature of the input mesh, allowing for adding, removing, or modifying individual objects.	Iterative inpainting may not capture all object views in one run, leading to potential texture inconsistencies. Fine-grained details on generated objects, especially those with complex topology, can be challenging.	scene texturing, scene generation, texture synthesis, diffusion models, 3d scene understanding
2406.02407 Report	WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections	Yuze Wang, Junyi Wang, Yue Qi	Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has shown promise for photorealistic and real-time NVS of static scenes. Building on 3DGS, we propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our key innovation is a residual-based spherical harmonic coefficients transfer module that adapts 3DGS to varying lighting conditions and photometric post-processing. This lightweight module can be pre-computed and ensures efficient gradient propagation from rendered images to 3D Gaussian attributes. Additionally, we observe that the appearance encoder and the transient mask predictor, the two most critical parts of NVS from unconstrained photo collections, can be mutually beneficial. We introduce a plug-and-play lightweight spatial attention module to simultaneously predict transient occluders and latent appearance representation for each image. After training and preprocessing, our method aligns with the standard 3DGS format and rendering pipeline, facilitating seamlessly integration into various 3DGS applications. Extensive experiments on diverse datasets show our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.	WE-GS, an efficient point-based differentiable rendering framework, reconstructs scenes from unconstrained photo collections, effectively handling appearance variations and transient occluders.	Existing methods struggle to balance rendering quality, speed, and storage efficiency when dealing with real-world photo collections containing varying lighting and moving objects.	The framework introduces: (1) a residual-based Spherical Harmonic coefficient transfer module for efficient appearance modeling under varying lighting, and (2) a lightweight spatial attention module to simultaneously predict transient masks and latent appearance representations for each image.	Achieves state-of-the-art novel view and appearance synthesis quality on PhotoTourism and NeRF-OSR datasets. Significantly reduces storage requirements (over 2x compared to 3DGS) while maintaining real-time rendering speed. Demonstrates superior efficiency with fast training times (over 17x faster than NeRF-based methods).	The performance of WE-GS can be affected by the quality of the initial 3D Gaussian estimates from SfM. Further exploration of the trade-off between rendering quality and efficiency is possible.	novel view synthesis, unconstrained photo collection, appearance modeling, real-time rendering, 3d gaussian splatting
2406.02395 Report	GrootVL: Tree Topology is All You Need in State Space Model	Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan	The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.	This paper proposes GrootVL, a novel framework employing an input-aware tree topology for feature propagation in state-space models to enhance long-range dependency modeling for both visual and language tasks.	Existing state-space models, while efficient, struggle to capture long-range dependencies. Fixed scanning strategies used for adapting to vision tasks fail to preserve 2D structural information, limiting their effectiveness.	GrootVL utilizes a tree-scanning algorithm to dynamically generate a tree topology based on input features, enabling more effective long-range interactions. It employs a linear complexity dynamic programming algorithm for efficient propagation.	GrootVL significantly outperforms existing structured state-space models on image classification, object detection, and segmentation tasks. GrootV, the visual sub-network, achieves competitive performance with CNN and Transformer-based approaches on ImageNet, MSCOCO, and ADE20K benchmarks. GrootL, the language sub-network, consistently improves language representation for pre-trained large language models with minor training cost, as demonstrated on various language understanding benchmarks.	The tree structure in GrootVL requires specific hardware optimization. Future work could explore the generalization of the tree topology to other applications beyond vision and language tasks.	state-space models, long-range dependencies, tree topology, dynamic programming, multi-modal learning
2406.02347 Report	Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation	Clement Chadebec, Onur Tasar, Eyal Benaroche, Benjamin Aubin	In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\alpha$), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.	This paper introduces Flash Diffusion, a novel distillation method designed to accelerate the image generation process of pre-trained diffusion models.	Diffusion models, while powerful, suffer from slow generation speeds due to the iterative nature of their sampling process. Flash Diffusion addresses this by significantly reducing the number of sampling steps required, making them more practical for real-time applications.	The method trains a student model to predict the output of a multi-step teacher model in a single step. It uses a combination of a distillation loss, an adversarial loss to enhance sample quality, and a distribution matching loss to ensure the student model's output closely resembles the teacher's learned data distribution.	Flash Diffusion achieves state-of-the-art FID and CLIP scores for few-step image generation on COCO2014 and COCO2017 datasets. The method demonstrates versatility by effectively performing across various tasks such as text-to-image, inpainting, super-resolution, and face-swapping. It exhibits strong compatibility with different diffusion model architectures like UNet and DiT, as well as with adapters.	Further reduction in the number of NFEs is desirable to push the boundaries of real-time generation. Exploring the application of direct preference optimization techniques on the student model could potentially lead to further enhancements in sample quality.	diffusion models, distillation, image generation, fast sampling, generative models
2406.02230 Report	I4VGen: Image as Stepping Stone for Text-to-Video Generation	Xiefan Guo, Jinlin Liu, Miaomiao Cui, Di Huang	Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video, I4VGen decomposes the text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. Correspondingly, a well-designed generation-selection pipeline is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative Noise-Invariant Video Score Distillation Sampling is incorporated to animate the image to a dynamic video, followed by a video regeneration process to refine the video. This inference strategy effectively mitigates the prevalent issue of non-zero terminal signal-to-noise ratio. Extensive evaluations show that I4VGen not only produces videos with higher visual realism and textual fidelity but also integrates seamlessly into existing image-to-video diffusion models, thereby improving overall video quality.	Introduces I4VGen, a training-free and plug-and-play inference framework for text-to-video generation that leverages image synthesis techniques to improve video quality and text-alignment.	Text-to-video generation lags behind text-to-image generation due to complex spatio-temporal modeling and limited video-text datasets. I4VGen aims to bridge this gap by leveraging robust image generation techniques without additional training.	I4VGen decomposes the process into two stages: 1) Anchor image synthesis: Generates multiple candidate images from the text prompt and selects the best one using a reward-based mechanism. 2) Anchor image-guided video synthesis: Animates the static anchor image using Noise-Invariant Video Score Distillation Sampling (NI-VSDS) and refines it through a video regeneration process.	Significantly improves the visual realism and textual fidelity of generated videos. Outperforms existing text-to-video generation methods in benchmark evaluations (VBench) across various aspects like temporal consistency, frame quality, and text alignment. Demonstrates versatility by seamlessly integrating with existing image-to-video diffusion models and enabling user-provided image animation.	Inference time is longer than baseline models, though shorter than some methods like FreeInit. Direct integration with FreeInit doesn't yield significant improvements.	text-to-video generation, diffusion models, image-guided synthesis, score distillation sampling, video quality enhancement
2406.02058 Report	OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding	Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang	This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations. To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency. These features exhibit both intra-object consistency and inter-object distinction. Then, we propose a two-stage codebook to discretize these features from coarse to fine levels. At the coarse level, we consider the positional information of 3D points to achieve location-based clustering, which is then refined at the fine level. Finally, we introduce an instance-level 3D-2D feature association method that links 3D points to 2D masks, which are further associated with 2D CLIP features. Extensive experiments, including open vocabulary-based 3D object selection, 3D point cloud understanding, click-based 3D object selection, and ablation studies, demonstrate the effectiveness of our proposed method. Project page: https://3d-aigc.github.io/OpenGaussian	This paper introduces OpenGaussian, a 3DGS-based method for 3D point-level open vocabulary understanding by associating high-dimensional CLIP features with 3D Gaussian points.	Existing 3DGS-based open vocabulary methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations, hindering applications requiring 3D point-level understanding, like robotics.	The method involves: 1) Training 3D point-level instance features with intra-mask smoothing and inter-mask contrastive loss using SAM masks, 2) Discretizing these features using a two-level coarse-to-fine codebook, and 3) Proposing a training-free instance-level 2D-3D association method based on IoU and feature distance to associate CLIP features with 3D instances.	OpenGaussian outperforms existing methods in open-vocabulary 3D object selection and point cloud understanding tasks. The method enables accurate click-based 3D object selection without requiring SAM feature supervision. The two-level codebook and instance-level 2D-3D feature association are shown to be crucial for achieving high performance.	The method relies on the accuracy of pre-trained SAM masks for instance feature learning. The computational cost of rendering and processing numerous Gaussians remains a challenge for large-scale scenes.	3d gaussian splatting, open vocabulary understanding, 3d point cloud segmentation, instance feature learning, 2d-3d feature association
2406.02021 Report	MetaMixer Is All You Need	Seokju Yun, Dongheon Lee, Youngmin Ro	Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. We hypothesize that the importance lies in query-key-value framework itself rather than in self-attention. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key and attention coefficient-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks. Our FFNet achieves remarkable performance improvements over previous state-of-the-art methods across a wide range of tasks. The strong and general performance of our proposed method validates our hypothesis and leads us to introduce MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework. We show that using only simple operations like convolution and GELU in the MetaMixer can achieve superior performance.	The paper introduces MetaMixer, a general mixer architecture based on the query-key-value framework, and proposes FFNification, a process that adapts self-attention to be more efficient by incorporating design elements from Feed-Forward Networks (FFNs).	This work aims to shift the focus from specific modules like self-attention to a more general understanding of mixer design, emphasizing the importance of the query-key-value framework in achieving high performance across various tasks.	The authors analyze FFNs in vision models, demonstrating their function as key-value memories. They then propose FFNification, which replaces expensive operations in self-attention with more efficient alternatives like depthwise convolution and GELU activation. They further validate the efficacy of the MetaMixer framework by introducing Fast-Forward Network (FFNet), a family of models built using FFNified attention and ConvNeXt blocks, and evaluate its performance on diverse tasks including image classification, object detection, semantic segmentation, super-resolution, 3D semantic segmentation, and time series forecasting.	FFNet models achieve state-of-the-art performance across a wide range of tasks, demonstrating a superior performance-speed trade-off compared to both transformer-based and convolution-based methods. The use of large-kernel depthwise convolution in FFNified attention enables efficient context aggregation and broader Effective Receptive Fields (ERFs), leading to improved performance. FFNet exhibits strong robustness, outperforming existing models on benchmark datasets designed to test generalization capabilities.	The effectiveness of MetaMixer-based convolutional mixers in large-scale datasets and generative modeling remains unproven. While the proposed method offers a novel perspective on mixer design, it leverages recent advancements, and future work should explore its full potential in a broader range of scenarios.	metamixer, ffnification, query-key-value framework, convolutional mixer, deep learning
2406.01970 Report	The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise	Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, Minhao Cheng	Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images. Notably, these patches are ``universal'' and can be generalized across various positions, seeds, and prompts. To be specific, extracting these patches from one noise and injecting them into another noise leads to object generation in targeted areas. We identify these patches by analyzing the dispersion of object bounding boxes across generated images, leading to the development of a posterior analysis technique. Furthermore, we create a dataset consisting of Gaussian noises labeled with bounding boxes corresponding to the objects appearing in the generated images and train a detector that identifies these patches from the initial noise. To explain the formation of these patches, we reveal that they are outliers in Gaussian noise, and follow distinct distributions through two-sample tests. Finally, we find the misalignment between prompts and the trigger patch patterns can result in unsuccessful image generations. The study proposes a reject-sampling strategy to obtain optimal noise, aiming to improve prompt adherence and positional diversity in image generation.	This paper discovers and leverages "trigger patches" – specific regions in the initial noise of diffusion models that strongly influence object location in generated images.	This work provides a new understanding of how diffusion models work and offers a way to improve control over image generation, addressing limitations in adhering to prompt instructions.	The authors first use a posterior analysis method, calculating "trigger entropy" to quantify object position consistency across images generated from the same noise. Then, they train a detector directly on noise to identify trigger patches, achieving promising results. They further investigate the nature of trigger patches, hypothesizing and verifying that they are outliers in the Gaussian noise distribution.	Trigger patches exist: Specific patches in the initial noise consistently lead to object generation at their corresponding locations across different prompts. Trigger patches can be detected directly from noise: A trained detector shows promising performance in identifying these patches without running the diffusion process. Trigger patches are outliers: They deviate significantly from the standard Gaussian distribution of initial noise.	The paper hasn't fully explored the case of multiple trigger patches within a single noise. The dataset used for analysis is limited to five object classes and 25 prompts.	diffusion models, image generation, object detection, trigger patches, positional bias
2406.01956 Report	Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt	Zhicheng Ding, Panfeng Li, Qikai Yang, Siyang Li	This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.	This paper proposes a novel framework that enhances image-to-image generation by incorporating LLaVA-generated prompts into Stable Diffusion, resulting in outputs that exhibit a stronger resemblance to the input image.	Relying solely on input images for generation can lead to deviations from user intent. This framework addresses these limitations by leveraging LLaVA's image understanding to create more accurate and detailed prompts, enhancing control and fidelity in image generation.	The input image is analyzed by LLaVA to generate textual descriptions (prompts). These prompts, along with the original image, are fed into Stable Diffusion to guide the generation process toward outputs that closely resemble the input.	LLaVA-generated prompts significantly improve visual coherence between generated and input images compared to traditional methods. Quantitative image similarity metrics (RMSE, PSNR, FSIM, SSIM, UIQ, SRE) confirm that LLaVA-generated prompts lead to the generation of more similar images. Extensive experiments across various scenarios consistently demonstrate the effectiveness of the proposed approach in enhancing image similarity.	Limitations in LLaVA's negative prompt generation accuracy require further investigation. Future work will explore fine-tuning LLaVA prompts to achieve a balance between faithfulness to the original image and artistic expression.	image-to-image generation, llava, stable diffusion, multimodal prompt generation, image similarity
2406.01954 Report	Plug-and-Play Diffusion Distillation	Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot	Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1\% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.	This paper introduces a novel distillation approach for guided diffusion models, using an external lightweight guide model trained alongside a frozen text-to-image model, effectively reducing inference computation without compromising image quality.	Diffusion models, while powerful in image generation, suffer from slow inference times due to their iterative process and reliance on classifier-free guidance. This work addresses this limitation by significantly reducing computational cost and preserving the advantages of the base model.	The method involves training a lightweight guide model that takes guidance values, time and text embeddings, and latent image representations as input. This model injects feature maps into the decoder of the original diffusion model to guide image generation. Two guide model architectures are explored: one based on ControlNet and a simplified 'tiny' version. The method is further enhanced by incorporating sampling steps distillation, progressively reducing the steps required for high-quality image generation.	The proposed approach reduces inference computation for classifier-free guided latent-space diffusion models by almost half, using only 1% of the base model's trainable parameters. The guide model, once trained, can be applied to various fine-tuned, domain-specific versions of the base diffusion model without requiring additional training, enabling a 'plug-and-play' functionality. Visualizations of the guide model's feature map injections provide insights into how classifier-free guidance influences image generation at different timesteps.	Unlike classifier-free guidance, the proposed approach may be less efficient when running in batches due to the parallel execution of the U-Net and the guide module. Future work could explore the application of this distillation method to pixel-based diffusion models.	diffusion models, distillation, image generation, classifier-free guidance, inference time reduction
2406.01900 Report	Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation	Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen	We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity. To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, we first adopt a new explicit motion signal, namely expression-aware landmark, to guide the animation process. We discover this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage. Then, we propose a facial fine-grained loss to improve the model's ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks. Accordingly, our method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals. By leveraging a simple and effective progressive generation strategy, we extend our model to stable long-term animation, thus increasing its potential application value. To address the lack of a benchmark for this field, we introduce EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. We show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.	Follow-Your-Emoji, a diffusion-based framework for portrait animation, enables animating diverse reference portraits (e.g., humans, cartoons, sculptures, animals) using target landmark sequences while preserving identity and achieving high fidelity.	Existing methods struggle to maintain identity and generate high-quality animations, particularly for uncommon portrait styles and subtle expressions.	The framework utilizes: (1) Expression-aware landmarks for accurate motion alignment and exaggerated expression portrayal; (2) Facial fine-grained loss to enhance facial appearance and expression generation; (3) Progressive generation strategy for long-term animation stability.	Follow-Your-Emoji effectively animates portraits in diverse styles with accurate expression transfer and identity preservation. The proposed expression-aware landmarks and facial fine-grained loss improve animation quality, especially for subtle expressions. Quantitative and qualitative evaluations on EmojiBench demonstrate the superiority of Follow-Your-Emoji over existing methods.	The reliance on MediaPipe for landmark detection can be limiting for certain portrait styles. Future work includes exploring alternative landmark detection methods and further improving long-term animation coherence.	portrait animation, diffusion models, expression-aware landmarks, facial fine-grained loss, emojibench
2406.01733 Report	Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching	Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang	Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed.	This paper introduces Learning-to-Cache (L2C), a novel caching mechanism to accelerate inference for diffusion transformers, exploiting layer redundancy across different timesteps.	Diffusion transformers excel in generative tasks but suffer from slow inference due to their large-scale parameter inference at each denoising step. L2C aims to accelerate this process without compromising image quality.	L2C leverages the identical layer structure in transformers and the sequential nature of diffusion to identify redundant computations. It employs a differentiable optimization objective to learn an input-invariant but timestep-variant router, enabling a static computation graph for efficient layer caching.	L2C significantly outperforms samplers with fewer steps (DDIM, DPM-Solver) and prior cache-based methods at the same inference speed. Experiments on DiT and U-ViT show that a large proportion of layers (up to 93.68% for U-ViT-H/2) can be cached with negligible FID degradation (<0.01). The learned caching patterns reveal distinct sparsity for DiT and U-ViT, suggesting architectural variations influence layer redundancy in diffusion transformers.	The effectiveness of L2C is dependent on the trained diffusion model architecture, limiting its generalizability. The current L2C implementation is capped at 2x speedup due to the two-step inference scheme, requiring further development for higher acceleration.	diffusion models, transformers, inference acceleration, caching, generative models
2406.01595 Report	MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild	Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song	We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.	This paper introduces MultiPly, a novel framework for reconstructing detailed 3D human models of multiple people from in-the-wild monocular videos.	Reconstructing multiple interacting individuals in 3D from monocular videos is crucial for applications like AR/VR and 4D social activity replay but remains a challenging task due to occlusions, complex dynamics, and depth ambiguities.	MultiPly utilizes a layered neural representation for the scene, combining individual human and background models. It leverages layer-wise differentiable volume rendering for learning and a hybrid instance segmentation approach combining self-supervised 3D and promptable 2D segmentation (using SAM). A confidence-guided optimization strategy alternates between optimizing pose and shape/appearance based on per-frame confidence.	MultiPly outperforms state-of-the-art methods in multi-person 3D reconstruction from monocular video, showing significant improvements in metrics like V-IoU and Chamfer distance. The proposed framework achieves superior novel view synthesis results compared to existing methods, generating sharper images with fewer artifacts. MultiPly demonstrates robust instance segmentation capabilities, surpassing baseline methods in accuracy, particularly in scenes with close human interaction.	The model's complexity scales linearly with the number of people, limiting its efficiency for crowded scenes. The current method does not explicitly model hands, presenting an opportunity for future work by integrating expressive hand models like SMPL-X.	3d human reconstruction, multi-person, monocular video, neural implicit representation, instance segmentation
2406.01594 Report	DiffUHaul: A Training-Free Method for Object Dragging in Images	Omri Avrahami, Rinon Gal, Gal Chechik, Ohad Fried, Dani Lischinski, Arash Vahdat, Weili Nie	Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.	Proposes DiffUHaul, a training-free method for dragging objects in images using the spatial understanding of a localized text-to-image model (BlobGEN).	Addresses the challenge of seamlessly relocating objects in images, a task that remains difficult for existing image editing techniques.	Utilizes BlobGEN's spatial understanding and introduces: (1) Gated self-attention masking to improve disentanglement, (2) Soft anchoring mechanism for fusing source object appearance with target location, (3) DDPM self-attention bucketing for real image editing.	Achieves superior object dragging performance compared to baselines, quantitatively and qualitatively. Demonstrates robustness in avoiding object traces, a common issue in other methods. Preferred by human evaluators in a user study for its effectiveness and realism.	Limitations in handling object rotation, resizing, and collisions. Future work includes addressing these limitations and exploring applications in other creative tasks.	object dragging, image editing, diffusion models, localized text-to-image generation, attention mechanisms
2406.01593 Report	Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting	Shaojie Ma, Yawei Luo, Yi Yang	3D reconstruction and simulation, while interrelated, have distinct objectives: reconstruction demands a flexible 3D representation adaptable to diverse scenes, whereas simulation requires a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to resolve such a dilemma. MaGS constrains 3D Gaussians to hover on the mesh surface, creating a mutual-adsorbed mesh-Gaussian 3D representation that combines the rendering flexibility of 3D Gaussians with the spatial coherence of meshes. Leveraging this representation, we introduce a learnable Relative Deformation Field (RDF) to model the relative displacement between the mesh and 3D Gaussians, extending traditional mesh-driven deformation paradigms that only rely on ARAP prior, thus capturing the motion of each 3D Gaussian more precisely. By joint optimizing meshes, 3D Gaussians, and RDF, MaGS achieves both high rendering accuracy and realistic deformation. Extensive experiments on the D-NeRF and NeRF-DS datasets demonstrate that MaGS can generate competitive results in both reconstruction and simulation.	Proposes MaGS, a novel method that combines 3D Gaussian Splatting with mesh representations for unified 3D reconstruction and simulation of dynamic objects from monocular videos.	Addresses the challenge of simultaneously achieving flexible 3D reconstruction and physically plausible simulations, which existing methods struggle to achieve within a single framework.	Utilizes a two-stage approach: 1) extracts a static mesh and estimates deformation field from 3D Gaussians, 2) introduces mesh-adsorbed Gaussians and a learnable Relative Deformation Field (RDF) to model fine-grained motions while preserving spatial coherence.	Achieves state-of-the-art results on D-NeRF and NeRF-DS datasets, demonstrating superior rendering quality and accuracy compared to existing methods. Enables realistic and user-interactive simulations like dragging by directly manipulating the mesh and propagating deformations to the adsorbed Gaussians. Ablation studies highlight the contribution of mesh-adsorbed Gaussians and RDF in improving reconstruction and simulation fidelity.	Performance depends on the accuracy of the initial mesh, posing challenges for low-resolution images or limited viewing angles. Future work includes extending MaGS to handle topology changes and incorporating physical priors for more realistic simulations.	3d reconstruction, 3d simulation, gaussian splatting, mesh deformation, dynamic scenes
2406.01592 Report	Text-guided Controllable Mesh Refinement for Interactive 3D Modeling	Yun-Chun Chen, Selena Ling, Zhiqin Chen, Vladimir G. Kim, Matheus Gadelha, Alec Jacobson	We propose a novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt. Our method is composed of three stages. First, we generate a single-view RGB image conditioned on the input coarse geometry and the input text prompt. This single-view image generation step allows the user to pre-visualize the result and offers stronger conditioning for subsequent multi-view generation. Second, we use our novel multi-view normal generation architecture to jointly generate six different views of the normal images. The joint view generation reduces inconsistencies and leads to sharper details. Third, we optimize our mesh with respect to all views and generate a fine, detailed geometry as output. The resulting method produces an output within seconds and offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh. Project page: https://text-mesh-refinement.github.io.	This paper introduces a novel technique for refining coarse 3D meshes by adding geometric details guided by text prompts.	Existing text-to-3D methods often lack control over the generated shape's structure, limiting their utility for artists. This method allows for detailed 3D mesh creation while maintaining control over both global structure and local details.	The method employs a three-stage process: 1) generating a single-view RGB preview image from the input mesh and text, 2) using a novel multi-view ControlNet to generate consistent normal images from multiple viewpoints guided by the preview image and input mesh, and 3) refining the input mesh based on the generated multi-view normals.	The method produces high-quality 3D meshes with better geometric details than state-of-the-art methods, as demonstrated by quantitative and subjective evaluations. The method offers control over the level of detail and pose of the final mesh. The method is significantly faster (at least 90x) than competing methods due to its reliance on feed-forward networks and direct mesh optimization.	The limited number of views and image resolution used during training restricts the level of detail achievable. Mesh refinement relies on an external image segmentation model, potentially introducing artifacts if the segmentation is inaccurate.	3d mesh refinement, text-guided generation, multi-view controlnet, differentiable rendering, interactive 3d modeling
2406.01584 Report	SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model	An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu	Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark will be released at https://www.anjiecheng.me/SpatialRGPT	Spatial Region GPT (SpatialRGPT) enhances the spatial reasoning abilities of Vision Language Models (VLMs) by incorporating a region representation module and a flexible plugin for depth information.	Existing VLMs struggle with spatial reasoning tasks, limiting their application in fields like robotics and augmented reality where precise spatial awareness is crucial.	The authors introduce: (1) a data curation pipeline to build 3D scene graphs from 2D images, generating region-aware spatial reasoning QAs; (2) a novel VLM architecture integrating depth information through a plugin module; and (3) SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations for evaluating 3D spatial cognition in VLMs.	SpatialRGPT significantly outperforms existing VLMs on the newly introduced SpatialRGBT-Bench, demonstrating superior spatial reasoning capabilities. The model effectively generalizes its learned spatial knowledge to real-world applications, functioning as a region-aware dense reward annotator for robotics. SpatialRGPT exhibits proficiency in complex spatial reasoning tasks, surpassing the capabilities of current leading vision-language models like GPT-4V.	The current implementation uses axis-aligned bounding boxes, which can be less accurate than oriented bounding boxes in estimating object dimensions, especially for partially elevated objects. Future work could explore integrating object pose estimation to improve the accuracy of object representation and spatial reasoning.	vision language models, spatial reasoning, 3d scene understanding, region-aware representation, depth information
2406.01583 Report	Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP	Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi	Recent works have explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features.These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations.	This paper presents a general framework for interpreting vision transformers (ViTs) by decomposing representations into contributions from individual components (like attention heads) and mapping them to CLIP space for text-based interpretation.	Understanding how ViTs process information and which components contribute to specific image features is crucial for improving their interpretability and reliability.	The framework utilizes: 1) AutoDecompose: An algorithm that automatically decomposes representations into component contributions by traversing the model's computational graph. 2) CompAlign: A method for mapping component contributions to CLIP's image representation space using trained linear maps, allowing for text-based interpretation via CLIP's text encoder. 3) Scoring Function: A novel function that quantifies the importance of each component for specific image features.	ImageNet-trained ViTs exhibit significant redundancy, with multiple layers encoding similar features. The scoring function successfully ranks components based on their relevance to specific features, enabling applications like targeted image retrieval and token importance visualization. The framework allows for zero-shot mitigation of spurious correlations in datasets like Waterbirds by ablating components highly associated with confounding factors.	The analysis primarily considers direct contributions from the last few layers and doesn't fully explore indirect contributions or finer component decompositions. Future work could investigate higher-order contributions and more granular decompositions, potentially identifying specific directions or subspaces within component contributions strongly associated with certain properties.	vision transformers, interpretability, clip, representation learning, feature attribution
2406.01579 Report	Tetrahedron Splatting for 3D Generation	Chun Gu, Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang	3D representation is essential to the significant advance of 3D generation with 2D diffusion priors. As a flexible representation, NeRF has been first adopted for 3D representation. With density-based volumetric rendering, it however suffers both intensive computational overhead and inaccurate mesh extraction. Using a signed distance field and Marching Tetrahedra, DMTet allows for precise mesh extraction and real-time rendering but is limited in handling large topological changes in meshes, leading to optimization challenges. Alternatively, 3D Gaussian Splatting (3DGS) is favored in both training and rendering efficiency while falling short in mesh extraction. In this work, we introduce a novel 3D representation, Tetrahedron Splatting (TeT-Splatting), that supports easy convergence during optimization, precise mesh extraction, and real-time rendering simultaneously. This is achieved by integrating surface-based volumetric rendering within a structured tetrahedral grid while preserving the desired ability of precise mesh extraction, and a tile-based differentiable tetrahedron rasterizer. Furthermore, we incorporate eikonal and normal consistency regularization terms for the signed distance field to improve generation quality and stability. Critically, our representation can be trained without mesh extraction, making the optimization process easier to converge. Our TeT-Splatting can be readily integrated in existing 3D generation pipelines, along with polygonal mesh for texture optimization. Extensive experiments show that our TeT-Splatting strikes a superior tradeoff among convergence speed, render efficiency, and mesh quality as compared to previous alternatives under varying 3D generation settings.	This paper introduces Tetrahedron Splatting (TeT-Splatting), a novel 3D representation for 3D generation that leverages volumetric rendering within a structured tetrahedral grid.	Existing 3D representations for 3D generation face trade-offs between convergence speed, render efficiency, and mesh quality. This work aims to address these limitations and enable high-fidelity 3D generation.	The method integrates surface-based volumetric rendering into a tetrahedral grid, enabling precise mesh extraction through Marching Tetrahedra. It employs a tile-based fast differentiable rasterizer for real-time rendering and incorporates eikonal and normal consistency regularization for improved generation quality.	TeT-Splatting demonstrates superior trade-off among convergence speed, render efficiency, and mesh quality compared to alternatives like Instant-NGP, DMTet, and 3DGS. The method achieves rapid and stable convergence in 3D generation tasks, effectively handling topological changes, unlike DMTet. Evaluations with both vanilla and rich diffusion priors show TeT-Splatting produces high-fidelity 3D content with detailed geometries and textures.	TeT-Splatting struggles with modeling high-frequency features due to the limitations of using tetrahedra as rendering primitives. The implemented rasterizer's rendering speed, although real-time, is slower than 3DGS and could be further improved.	3d generation, 3d representation, tetrahedron splatting, volumetric rendering, diffusion models
2406.01561 Report	Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation	Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, Hai Huang	Diffusion-based text-to-image generation models trained on extensive text-image pairs have shown the capacity to generate photorealistic images consistent with textual descriptions. However, a significant limitation of these models is their slow sample generation, which requires iterative refinement through the same network. In this paper, we enhance Score identity Distillation (SiD) by developing long and short classifier-free guidance (LSG) to efficiently distill pretrained Stable Diffusion models without using real training data. SiD aims to optimize a model-based explicit score matching loss, utilizing a score-identity-based approximation alongside the proposed LSG for practical computation. By training exclusively with fake images synthesized with its one-step generator, SiD equipped with LSG rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Specifically, its data-free distillation of Stable Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation set, with a CLIP score of 0.304 at an LSG scale of 1.5, and a FID of 9.56 with a CLIP score of 0.313 at an LSG scale of 2. We will make our PyTorch implementation and distilled Stable Diffusion one-step generators available at https://github.com/mingyuanzhou/SiD-LSG	This paper introduces a novel method combining Classifier-Free Guidance (CFG) with Score Identity Distillation (SiD) to effectively distill Stable Diffusion models into one-step generators, using only synthesized fake images.	Diffusion models, while powerful for text-to-image generation, are computationally expensive due to their iterative nature. This work addresses this limitation by enabling fast, one-step generation without sacrificing performance.	The study introduces "long and short guidance" (LSG) strategies for injecting CFG into SiD. It explores enhancing CFG for the teacher network, reducing it for the student network, and a combined approach for optimized FID and CLIP score balance.	The proposed SiD-LSG achieves state-of-the-art FID scores among one-step distillation methods on the COCO-2014 dataset. The method demonstrates successful distillation of both SD 1.5 and 2.1-base, achieving FID scores as low as 9.56 and 10.97, respectively, while maintaining competitive CLIP scores. A record low FID of 8.15 is achieved with SD1.5 distillation by reducing the guidance scale and extending training time, outperforming even the teacher model.	The current SiD-LSG implementation shows limitations in reaching the full text-image alignment capabilities of the teacher model, suggesting future exploration of multi-step generation or model size increase. While FP16 mixed precision accelerates training, it currently limits achieving the lowest FID and highest CLIP scores compared to FP32, necessitating further optimization research.	text-to-image generation, diffusion models, model distillation, classifier-free guidance, stable diffusion
2406.01493 Report	Learning Temporally Consistent Video Depth from Video Diffusion Priors	Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Matteo Poggi, Yiyi Liao	This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy -- first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen -- yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis. Our project page is available at https://jhaoshao.github.io/ChronoDepth/.	This paper introduces ChronoDepth, a novel video depth estimation method that prioritizes temporal consistency by leveraging pre-trained video generation models (specifically, Stable Video Diffusion).	Temporal consistency in video depth estimation is crucial for eliminating flickering artifacts and ensuring realistic 3D applications, yet current methods struggle to achieve both temporal consistency and spatial accuracy.	The authors reformulate depth estimation as a conditional denoising diffusion generation task. They propose a two-stage fine-tuning strategy: optimizing spatial layers with single-frame depths, then freezing them and optimizing temporal layers using randomly-sized video clips. For inference, a novel temporal inpaint strategy enhances consistency across clips.	ChronoDepth achieves state-of-the-art temporal consistency on benchmark datasets, surpassing both image and video depth estimation methods. It maintains comparable spatial accuracy to state-of-the-art single-image depth estimators. ChronoDepth demonstrates superior performance in downstream applications like depth-conditioned video generation and novel view synthesis.	The reliance on synthetic datasets for training might limit generalization to diverse real-world scenarios. Future work could explore larger and more varied datasets, as well as alternative video generation models.	video depth estimation, temporal consistency, video diffusion models, stable video diffusion, conditional denoising diffusion
2406.01476 Report	DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians with Video Diffusion Priors	Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, Rynson W. H. Lau	Dynamic 3D interaction has witnessed great interest in recent works, while creating such 4D content remains challenging. One solution is to animate 3D scenes with physics-based simulation, and the other is to learn the deformation of static 3D objects with the distillation of video generative models. The former one requires assigning precise physical properties to the target object, otherwise the simulated results would become unnatural. The latter tends to formulate the video with minor motions and discontinuous frames, due to the absence of physical constraints in deformation learning. We think that video generative models are trained with real-world captured data, capable of judging physical phenomenon in simulation environments. To this end, we propose DreamPhysics in this work, which estimates physical properties of 3D Gaussian Splatting with video diffusion priors. DreamPhysics supports both image- and text-conditioned guidance, optimizing physical parameters via score distillation sampling with frame interpolation and log gradient. Based on a material point method simulator with proper physical parameters, our method can generate 4D content with realistic motions. Experimental results demonstrate that, by distilling the prior knowledge of video diffusion models, inaccurate physical properties can be gradually refined for high-quality simulation. Codes are released at: https://github.com/tyhuang0428/DreamPhysics.	DreamPhysics, a novel framework, leverages video diffusion priors to estimate physical properties for dynamic 3D Gaussian Splatting (GS), enabling the generation of realistic 4D content.	Creating dynamic 3D content with realistic physics remains challenging. Existing methods either rely on manual assignment of physical properties, leading to unnatural results, or learn deformation from video data lacking physical constraints, resulting in limited and unrealistic motion.	DreamPhysics employs a Material Point Method (MPM) simulator to animate 3D GS scenes. It leverages Score Distillation Sampling (SDS) to optimize physical parameters based on video diffusion models' guidance, ensuring adherence to realistic physical behavior during animation.	DreamPhysics effectively distills physical priors from video diffusion models, enabling accurate estimation of physical properties for 3D objects. The framework supports both image- and text-conditioned optimization, broadening its applicability. Compared to existing 4D generation methods, DreamPhysics achieves more realistic motion simulation and faster training.	The range of simulated motions is currently limited, requiring further exploration of various physical constraints. Current evaluation metrics for simulated videos rely on visual quality, necessitating the development of physics-based metrics for more comprehensive assessment.	4d content generation, physics-based simulation, video diffusion models, 3d gaussian splatting, score distillation sampling
2406.01467 Report	RaDe-GS: Rasterizing Depth in Gaussian Splatting	Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, Ping Tan	Gaussian Splatting (GS) has proven to be highly effective in novel view synthesis, achieving high-quality and real-time rendering. However, its potential for reconstructing detailed 3D shapes has not been fully explored. Existing methods often suffer from limited shape accuracy due to the discrete and unstructured nature of Gaussian splats, which complicates the shape extraction. While recent techniques like 2D GS have attempted to improve shape reconstruction, they often reformulate the Gaussian primitives in ways that reduce both rendering quality and computational efficiency. To address these problems, our work introduces a rasterized approach to render the depth maps and surface normal maps of general 3D Gaussian splats. Our method not only significantly enhances shape reconstruction accuracy but also maintains the computational efficiency intrinsic to Gaussian Splatting. Our approach achieves a Chamfer distance error comparable to NeuraLangelo on the DTU dataset and similar training and rendering time as traditional Gaussian Splatting on the Tanks & Temples dataset. Our method is a significant advancement in Gaussian Splatting and can be directly integrated into existing Gaussian Splatting-based methods.	This paper introduces RaDe-GS, a novel rasterized method for computing depth and normal maps of general 3D Gaussian splats, enhancing 3D shape reconstruction accuracy in Gaussian Splatting while maintaining its computational efficiency.	Gaussian Splatting is efficient for novel view synthesis but struggles with accurate 3D shape reconstruction due to the discrete nature of Gaussian splats. Existing methods trying to address this compromise rendering quality and efficiency.	The authors derive a closed-form solution for intersections of light rays and Gaussian splats, enabling efficient depth map calculation. They leverage the approximate affine projection to compute spatially varying depth within projected Gaussian splats, enabling rasterization for depth and normal map computation.	RaDe-GS achieves a Chamfer distance error of 0.69 mm on the DTU dataset, comparable to NeuraLangelo and surpassing other Gaussian Splatting methods. The method maintains similar training and rendering time as traditional Gaussian Splatting on the Tanks & Temples dataset (around 17.8 minutes). It achieves high-quality novel view synthesis, outperforming other Gaussian Splatting methods in PSNR and perceptual metrics.	Current TSDF fusion is limited to low-resolution voxel grids for large scenes, impacting surface extraction accuracy. Reconstruction of reflective surfaces is limited by the simple color function in 3D GS, potentially addressed by incorporating advanced color representations.	gaussian splatting, 3d reconstruction, novel view synthesis, depth map estimation, rasterization
2406.01460 Report	MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization	Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang	Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP's single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.	Proposes MLIP, a Multi-Perspective Language-Image Pretraining framework, which introduces frequency domain analysis and token merging to improve CLIP's data efficiency and training speed.	CLIP suffers from inefficient data utilization and high computational costs due to its reliance on single contrastive supervision and the presence of non-informative tokens.	MLIP splits the image encoder into Frequency and Spatial Stages for multi-domain supervision. It implements joint spatial-frequency token alignment for fine-grained representation learning and utilizes token merging guided by frequency-spatial information for acceleration.	MLIP achieves competitive zero-shot and linear-probe image classification accuracy compared to CLIP and its variants. MLIP demonstrates superior performance in zero-shot image-text retrieval tasks, particularly in recall@1 metrics. MLIP achieves a better computation-performance balance than other CLIP-like models.	MLIP is currently only explored with ViT-based architectures, limiting its applicability to CNN-based models. The token merging in MLIP poses challenges for its application in dense vision downstream tasks like segmentation.	multimodal learning, vision-language pretraining, contrastive learning, frequency domain analysis, token merging
2406.01388 Report	AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation	Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang	As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.	This paper proposes AutoStudio, a training-free multi-agent framework for multi-turn interactive image generation, which addresses the challenge of maintaining multi-subject consistency over multiple turns.	Existing methods struggle to maintain consistency across multiple subjects in interactive image generation tasks, especially when users frequently switch subjects or provide complex instructions.	AutoStudio employs three LLM-based agents for dialogue interpretation, layout generation, and layout supervision, along with a stable diffusion-based agent enhanced by a Parallel-UNet and a subject-initialized generation method for image synthesis.	AutoStudio outperforms existing methods on CMIGBench, demonstrating superior performance in maintaining multi-subject consistency and generating high-quality images. The proposed P-UNet architecture and subject-initialized generation method effectively enhance subject consistency during image generation. Human evaluation confirms AutoStudio's ability to generate images that align better with user intentions.	AutoStudio may exhibit limitations in generating intricate details, particularly in close-interaction scenarios between subjects. The use of multiple agents can increase computational time and resource requirements.	multi-turn interactive image generation, multi-agent framework, subject consistency, stable diffusion, layout generation
2406.01334 Report	HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models	Mengcheng Li, Hongwen Zhang, Yuxiang Zhang, Ruizhi Shao, Tao Yu, Yebin Liu	Recent years have witnessed a trend of the deep integration of the generation and reconstruction paradigms. In this paper, we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation, inpainting, reconstruction, and fitting in a single framework, which we name as Holistic Hand Mesh Recovery (HHMR). Our key observation is that different kinds of hand mesh recovery tasks can be achieved by a single generative model with strong multimodal controllability, and in such a framework, realizing different tasks only requires giving different signals as conditions. To achieve this goal, we propose an all-in-one diffusion framework based on graph convolution and attention mechanisms for holistic hand mesh recovery. In order to achieve strong control generation capability while ensuring the decoupling of multimodal control signals, we map different modalities to a shared feature space and apply cross-scale random masking in both modality and feature levels. In this way, the correlation between different modalities can be fully exploited during the learning of hand priors. Furthermore, we propose Condition-aligned Gradient Guidance to enhance the alignment of the generated model with the control signals, which significantly improves the accuracy of the hand mesh reconstruction and fitting. Experiments show that our novel framework can realize multiple hand mesh recovery tasks simultaneously and outperform the existing methods in different tasks, which provides more possibilities for subsequent downstream applications including gesture recognition, pose generation, mesh editing, and so on.	This paper presents HHMR, a unified graph diffusion-based framework for holistic hand mesh recovery, enabling simultaneous direct generation, inpainting, reconstruction, and fitting.	Unifying these tasks within a single framework can enhance their mutual benefits and improve efficiency compared to separate models.	The method utilizes a U-shaped graph convolutional network with self- and cross-attention to learn hand priors from various input conditions (images, skeletons, etc.) and progressively denoise a 3D hand mesh. It also employs random masking and a condition-aligned gradient guidance strategy for enhanced control and accuracy.	HHMR generates more diverse and realistic hand meshes compared to PCA-based methods. It achieves comparable single-hypothesis reconstruction results and superior multi-hypothesis results on FreiHAND dataset, outperforming state-of-the-art approaches. The condition-aligned gradient guidance significantly improves accuracy in 2D hand mesh fitting tasks.	The model might not perform well with extremely noisy or incomplete input conditions. Increasing denoising steps for higher precision comes with increased computational cost.	hand mesh recovery, diffusion models, generative models, graph convolutional networks, multimodal learning
2406.01210 Report	GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer	Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen	Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion	This paper introduces GeminiFusion, an efficient pixel-wise multimodal fusion module for vision transformers that leverages the inherent alignment of multi-modality input in vision tasks, outperforming token exchange methods like TokenFusion.	Multimodal fusion in vision transformers is often limited by either the sub-optimality of token exchange methods or the computational overhead of cross-attention mechanisms. GeminiFusion addresses these limitations, offering both efficiency and state-of-the-art performance.	GeminiFusion prioritizes interactions between spatially co-located patches from different modalities using a pixel-wise attention mechanism. It incorporates a relation discriminator to improve feature selection and layer-adaptive noise for better self/cross-attention balance.	GeminiFusion consistently outperforms TokenFusion on multimodal semantic segmentation tasks, achieving improvements up to 3.4% in mIoU on the DeLiVER dataset. It also excels in image-to-image translation, showing significant improvements in FID/KID scores on the Taskonomy dataset. GeminiFusion demonstrates efficiency gains over TokenFusion, achieving comparable inference latency to unimodal networks.	GeminiFusion, in its current form, is primarily designed for homogeneous modalities and might not be directly applicable to heterogeneous data like images paired with audio or text. Further research is needed to extend its capabilities to handle heterogeneous data combinations.	multimodal fusion, vision transformer, geminifusion, semantic segmentation, image-to-image translation
2406.01203 Report	Scaling Up Deep Clustering Methods Beyond ImageNet-1K	Nikolas Adaloglou, Felix Michels, Kaspar Senft, Diana Petrusheva, Markus Kollmann	Deep image clustering methods are typically evaluated on small-scale balanced classification datasets while feature-based $k$-means has been applied on proprietary billion-scale datasets. In this work, we explore the performance of feature-based deep clustering approaches on large-scale benchmarks whilst disentangling the impact of the following data-related factors: i) class imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes. Consequently, we develop multiple new benchmarks based on ImageNet21K. Our experimental analysis reveals that feature-based $k$-means is often unfairly evaluated on balanced datasets. However, deep clustering methods outperform $k$-means across most large-scale benchmarks. Interestingly, $k$-means underperforms on easy-to-classify benchmarks by large margins. The performance gap, however, diminishes on the highest data regimes such as ImageNet21K. Finally, we find that non-primary cluster predictions capture meaningful classes (i.e. coarser classes).	This paper presents a comprehensive experimental study on large-scale image clustering methods and benchmarks, focusing on factors like class imbalance, granularity, ease of classification, and multi-label capture.	Existing deep image clustering methods are often evaluated on small, balanced datasets, limiting their applicability to real-world, large-scale scenarios. This work addresses this gap by exploring their performance on challenging, large-scale benchmarks.	The study creates new benchmarks based on ImageNet21K, varying factors like class imbalance and granularity. It compares the performance of feature-based k-means with deep clustering methods like TEMI and SCANv2 on these benchmarks.	Deep clustering methods outperform k-means on most benchmarks, except for cases with highly coarse labels or the largest dataset scales. K-means performs poorly on benchmarks with easily classifiable classes, suggesting limitations in capturing irregular class shapes. Non-primary cluster predictions from clustering methods can capture meaningful secondary classes like coarser labels.	The study relies on pre-trained feature extractors, limiting the exploration of feature learning in clustering. The sensitivity of SCANv2 to mini-batch size poses computational challenges for large-scale datasets.	image clustering, large-scale benchmarks, class imbalance, class granularity, multi-label clustering
2406.01188 Report	UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation	Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang	Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: https://unianimate.github.io/.	This paper proposes UniAnimate, a novel video diffusion model framework for consistent and efficient human image animation, addressing limitations of existing methods in handling long video generation and appearance misalignment.	Human image animation is a challenging task crucial for various applications like video creation and virtual reality. Existing methods face limitations in maintaining temporal consistency, aligning appearance with reference images, and generating long videos.	UniAnimate leverages a unified video diffusion model to encode both reference image and video content in a shared feature space for enhanced appearance alignment. It introduces a unified noised input supporting both random and first-frame conditioned videos for smooth transitions in long sequences. Additionally, it explores temporal Mamba, an efficient alternative to temporal Transformers for long-range temporal modeling.	UniAnimate demonstrates superior performance over state-of-the-art methods on benchmark datasets, achieving higher visual quality, identity preservation, and temporal consistency. The proposed unified video diffusion model significantly improves appearance alignment compared to using separate networks for reference image and video generation. Temporal Mamba proves to be an effective and efficient alternative to temporal Transformers for long video generation, exhibiting comparable performance with reduced memory consumption.	Generating fine-grained details in facial and hand regions remains challenging. Occasional inconsistencies in completing invisible parts across different video segments may lead to temporal artifacts.	video generation, human image animation, diffusion model, temporal modeling, appearance alignment
2406.01159 Report	Dimba: Transformer-Mamba Diffusion Models	Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang	This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.	This paper introduces Dimba, a novel text-to-image diffusion model that leverages a hybrid architecture combining Transformer and Mamba layers for enhanced efficiency and performance.	Existing text-to-image models often suffer from high memory requirements and limitations in handling long contexts. Dimba addresses these limitations by integrating the strengths of both Transformer and Mamba architectures.	Dimba interleaves Transformer and Mamba layers, incorporating conditional information through cross-attention. The authors trained Dimba using a large-scale, curated image-text dataset with a focus on aesthetic quality, employing techniques like quality tuning and resolution adaptation.	Dimba achieves comparable image quality and semantic alignment compared to existing diffusion models, as evidenced by FID scores and T2I-CompBench results. The hybrid architecture allows for flexibility in balancing throughput and memory requirements based on specific needs. Quality tuning with a curated dataset significantly improves the aesthetic quality of generated images.	Dimba may inherit biases from the training data, impacting its ability to generate certain styles, scenes, or objects. Potential negative social impacts, such as perpetuating stereotypes, need to be addressed in future research.	text-to-image generation, diffusion models, hybrid architecture, transformer, mamba
2406.01125 Report	$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers	Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, Tao Chen	Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework $\Delta$-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called $\Delta$-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-$\alpha$ and DiT-XL demonstrate that the $\Delta$-DiT can achieve a $1.6\times$ speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging $1.12\times$ acceleration, our method significantly outperforms existing methods. Our code will be publicly available.	This paper introduces Δ-DiT, a training-free inference acceleration method for diffusion transformers (DiT) that leverages a novel cache mechanism called Δ-Cache.	Existing diffusion model acceleration techniques primarily focus on UNet architectures, while DiT models lack dedicated acceleration frameworks despite their success.	The paper first analyzes challenges in applying existing cache methods to DiT and proposes Δ-Cache, which caches feature map deviations instead of the maps themselves to preserve information. It then investigates the impact of DiT blocks on generation, finding that front blocks contribute to outlines while rear blocks contribute to details. Δ-DiT leverages this by caching rear blocks in early sampling stages (outline generation) and front blocks in later stages (detail generation).	Δ-DiT achieves a 1.6x speedup on 20-step generation with comparable or better generation quality compared to baseline models. Δ-DiT outperforms existing methods in challenging scenarios like 4-step consistent model generation and at higher acceleration ratios (1.12x). The proposed Δ-Cache method is compatible with various advanced solvers and consistently outperforms baseline methods.	The exploration of the relationship between DiT blocks and generated images is preliminary and coarse-grained. Future work could explore more fine-grained search or learning strategies for further improvements.	diffusion models, transformers, inference acceleration, cache mechanism, image generation
2406.01069 Report	UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment	Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li	Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.	This paper proposes UniQA, a novel method for unified vision-language pre-training for both Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) tasks.	Existing methods often address IQA and IAA independently, neglecting the interconnectedness of human perception of image quality and aesthetics. This unified approach aims to learn generalizable representations for both tasks, enhancing their effectiveness and efficiency.	The proposed UniQA utilizes Multimodal Large Language Models (MLLMs) to generate quality- and aesthetics-related descriptions for IQA and IAA datasets. It leverages these descriptions to pre-train a vision-language model and introduces a lightweight Multi-Cue Integration Adapter to fine-tune the pre-trained model on specific IQA and IAA datasets.	UniQA achieves state-of-the-art performance on multiple benchmark datasets for both IQA and IAA tasks. The model demonstrates excellent zero-shot and few-label image assessment capabilities, indicating its strong generalization ability and data efficiency. Qualitative results showcasing the model's ability to retrieve images based on quality- and aesthetics-related queries provide further evidence of its effectiveness.	The generated captions by MLLMs often have similar structures, potentially limiting the diversity and richness of representations learned during pre-training. Exploring methods to enhance the diversity of MLLMs-generated captions, such as integrating multiple MLLMs or using in-context learning, is an important direction for future research.	image quality assessment, image aesthetic assessment, vision-language pre-training, multimodal large language models, zero-shot learning
2406.01062 Report	SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models	Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, Praveen Krishnan	While diffusion models have significantly advanced the quality of image generation, their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts, an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges, this paper introduces SceneTextGen, a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so, SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties, coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.	Introduces SceneTextGen, a novel diffusion-based model for scene text generation that surpasses the limitations of predefined layouts, allowing flexible text placement and diverse text styles.	Current diffusion models struggle to generate text within images that is both visually appealing and contextually relevant. Existing methods are limited by predefined layouts, restricting diversity in font styles and text positioning.	SceneTextGen utilizes a character-level encoder to capture typographic properties and integrates this information into the cross-attention layers of a diffusion model. It also employs word-level and character-level losses to ensure text accuracy and prevent excessive text generation.	SceneTextGen outperforms existing models in OCR-based text recognition scores, indicating its ability to generate clear and accurate text. SceneTextGen demonstrates superior diversity in font styles compared to methods relying on predefined layouts. The model shows strong generalization capability, achieving robust OCR scores on datasets beyond its training data.	SceneTextGen faces challenges in generating complex visual elements in conjunction with text. The model's performance in terms of text accuracy and coherence decreases with increasing text length.	text generation, image generation, diffusion models, scene text, computer vision
2406.01042 Report	Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting	Fang Li, Hao Zhang, Narendra Ahuja	Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS.	This paper proposes SC-4DGS, a novel method for high-fidelity 4D novel view synthesis of dynamic scenes using Gaussian Splatting with self-calibrated camera parameters, eliminating the need for camera priors and handling videos of varying lengths.	Current 4D NVS methods often rely on external camera parameter estimation tools like COLMAP, which can be inaccurate and time-consuming, especially for dynamic scenes with large object movements or complex camera trajectories. SC-4DGS addresses these limitations by jointly optimizing camera parameters and scene representation.	The method employs a three-step process: 1) Structural Point Extraction (SPE) to establish 2D-3D correspondences of structural points across frames. 2) Joint optimization of camera parameters and 3D structural points using extracted 2D points and their correspondence. 3) Optimization of dynamic scene representation using a Canonical Field and a Deformation Field, initialized with the optimized 3D structural points.	SC-4DGS achieves comparable or superior novel view synthesis quality to state-of-the-art methods on benchmark datasets like NeRF-DS and DAVIS. The proposed method demonstrates more robust and accurate camera parameter estimation compared to COLMAP and RoDynRF, especially in scenes with extreme camera motions. SC-4DGS efficiently handles long monocular videos, overcoming limitations of existing methods like RoDynRF.	The method currently assumes a constant focal length throughout the video, limiting its applicability to scenarios with zoom effects. Reliance on ground truth motion masks as input poses challenges for scenes with complex, high-speed fluid motion, suggesting future work in automatic motion segmentation.	novel view synthesis, gaussian splatting, self-calibration, dynamic scene reconstruction, camera parameter estimation
2406.01020 Report	CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment	Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim	In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we carefully select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability. We will make the code available for access.	Presents ATTIQA, a novel pretraining framework for IQA that leverages CLIP's knowledge and large datasets to construct a generalizable and robust attribute-aware representation space.	Addresses the limitations of traditional IQA methods, which suffer from small dataset sizes and poor generalization abilities, by effectively integrating VLM (CLIP) and large-scale data pretraining for enhanced IQA.	Employs a two-stage approach: 1) Prompt Selection: Utilizes GPT-4 to generate candidate prompts and selects optimal prompts for 5 key attributes via proxy tasks measuring distortion intensity and human perception alignment. 2) Pretraining Pipeline: Generates attribute-aware pseudo-labels using CLIP with selected prompts on a large dataset (ImageNet) and trains the IQA model with a ranking-based loss for enhanced robustness.	Achieves state-of-the-art performance on multiple IQA and aesthetic quality datasets, demonstrating significant improvements over existing methods. Exhibits superior generalization capabilities in cross-dataset validation, indicating robustness and adaptability to unseen data. Successfully applied as a metric for evaluating generative models and guiding image enhancement through reinforcement learning, highlighting its practical value in real-world scenarios.	Current attribute focus is limited to five common attributes, potentially overlooking other relevant image quality factors. Future work will explore expanding the representation space to incorporate additional attributes and further enhance the model's comprehensiveness.	image quality assessment, vision language model, clip, pretraining, generalization
2406.00985 Report	MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models	Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, Siwei Lyu	Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through an innovative attention distribution mechanism and a multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and attributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios. Dataset and code are available at https://mingzhenhuang.com/projects/MultiEdits.html.	Introduces MultiEdits, a method for text-driven image editing that efficiently handles simultaneous edits across multiple attributes.	Addresses the limitations of existing methods that struggle with multi-aspect editing due to computational overhead and error accumulation in sequential applications.	Utilizes an attention grouping mechanism to categorize edits, employs multiple target branches for parallel processing, and leverages cross-branch interactions for consistency.	Outperforms state-of-the-art methods in terms of editing effectiveness and efficiency on the introduced PIE-Bench++ dataset. Demonstrates robustness across varying numbers of editing aspects. Maintains content and background preservation during multi-aspect editing.	Limitations in handling text editing within images and dramatic background changes. Future work includes exploring semantic order of edits and addressing limitations.	text-driven image editing, multi-aspect editing, diffusion models, attention mechanism, pie-bench++ dataset
2406.00908 Report	ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation	Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He	Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.	This paper introduces a training-free video interpolation method for generative video diffusion models, enhancing their frame rate generation capabilities without requiring additional training data or parameter updates.	Existing video generation models often produce low frame rate videos due to GPU memory limitations and challenges in modeling long sequences. Current interpolation methods necessitate training or are model-specific, hindering their generalizability.	The method transforms the target video model into a self-cascaded architecture with hidden state correction modules. These modules refine hidden states within the transformer blocks for improved temporal consistency across generated frames.	The method generates high frame rate (2x and 4x) videos with superior visual quality and temporal consistency compared to direct inference and latent space back-projection. ZeroSmooth maintains key frame content effectively during high frame rate generation, evidenced by high PSNR and SSIM scores. The approach exhibits competitive performance against training-based video interpolation methods while remaining entirely training-free.	The interpolation performance heavily depends on the quality and consistency of the base video model's generated frames. Future work could explore extending this method to handle variable frame rate interpolation for more flexible video generation.	video generation, video interpolation, diffusion models, training-free, self-cascaded architecture
2406.00830 Report	Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection	Yang Cao, Yihan Zeng, Hang Xu, Dan Xu	Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem. In this work, we propose CoDAv2, a unified framework designed to innovatively tackle both the localization and classification of novel 3D objects, under the condition of limited base categories. For localization, the proposed 3D Novel Object Discovery (3D-NOD) strategy utilizes 3D geometries and 2D open-vocabulary semantic priors to discover pseudo labels for novel objects during training. 3D-NOD is further extended with an Enrichment strategy that significantly enriches the novel object distribution in the training scenes, and then enhances the model's ability to localize more novel objects. The 3D-NOD with Enrichment is termed 3D-NODE. For classification, the Discovery-driven Cross-modal Alignment (DCMA) module aligns features from 3D point clouds and 2D/textual modalities, employing both class-agnostic and class-specific alignments that are iteratively refined to handle the expanding vocabulary of objects. Besides, 2D box guidance boosts the classification accuracy against complex background noises, which is coined as Box-DCMA. Extensive evaluation demonstrates the superiority of CoDAv2. CoDAv2 outperforms the best-performing method by a large margin (AP_Novel of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2). Source code and pre-trained models are available at the GitHub project page.	This paper presents CoDAv2, an open-vocabulary 3D object detection framework that can localize and classify novel 3D objects by learning from limited base categories.	Open-Vocabulary 3D Object Detection (OV-3DDet) is crucial for real-world applications where novel object categories are frequently encountered.	CoDAv2 employs a 3D Novel Object Discovery with Enrichment (3D-NODE) strategy for localization and a Discovery-driven Cross-Modal Alignment with box guidance (Box-DCMA) module for classification.	CoDAv2 significantly outperforms previous state-of-the-art methods, achieving AP_Novel scores of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2. 3D-NODE effectively discovers novel objects during training by leveraging 3D geometry and 2D semantic priors, leading to improved localization. Box-DCMA aligns 3D features with 2D/textual features from CLIP, enhancing classification accuracy and effectively discriminating against background noise.	The open-vocabulary ability decreases when tested with a large number of novel categories due to the limitations of using only point cloud data. Future work may explore incorporating multi-modality inputs to enhance performance on larger vocabularies.	open-vocabulary 3d object detection, 3d novel object discovery, cross-modal alignment, 3d perception, multi-modality learning
2406.00687 Report	Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors	Ohad Rahamim, Hilit Segev, Idan Achituve, Yuval Atzmon, Yoni Kasten, Gal Chechik	Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves the task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglecting any of them. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.	\ourmethod{} is a novel method for open-set 3D object arrangement, leveraging pre-trained text-to-image diffusion models to arrange unseen 3D objects into plausible scenes based on textual descriptions.	Generating 3D scenes with multiple, high-resolution objects is a challenging problem in visual generative AI. Existing methods often struggle with object neglect and generating coherent layouts. \ourmethod{} addresses these challenges by leveraging the rich spatial understanding of text-to-image models.	\ourmethod{} uses a two-stage approach: 1) Personalized Image Generation: Fine-tunes a pre-trained text-to-image model with rendered views of the input objects to generate a scene image incorporating them. 2) Transformation Optimization: Infers 3D object poses from the scene image using \ourpnp{}, a novel method that combines Perspective-n-Points with physical constraints to find plausible object placements.	Outperforms baseline methods in terms of FID, KID, and CLIP similarity scores, indicating more realistic and textually-aligned scene generation. Human raters significantly prefer \ourmethod{} generated layouts over random and circular arrangements, demonstrating its ability to create more plausible and aesthetically pleasing scenes. Ablation studies highlight the importance of both the personalization stage and the \ourpnp{} method in achieving high-quality results.	Limited to arranging objects and does not generate scene context. Performance depends on the underlying text-to-image personalization method, which can suffer from object neglect, particularly with a large number of objects.	3d scene synthesis, text-to-image generation, object arrangement, personalization, perspective-n-points
2406.00670 Report	Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation	Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng	Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP	The paper proposes Cascade-CLIP, a cascaded vision-language embedding alignment framework, for zero-shot semantic segmentation using multi-level features from pre-trained CLIP models.	Existing CLIP-based methods for zero-shot semantic segmentation primarily use features from the last layer, neglecting the rich object details present in intermediate layers. Directly aggregating multi-level features degrades performance due to large feature discrepancies between layers, weakening CLIP's zero-shot capability.	Cascade-CLIP splits the CLIP visual encoder into stages and aligns multi-level visual features with text embeddings using cascaded decoders. It employs a Neighborhood Gaussian Aggregation (NGA) module to fuse multi-level features within each stage, assigning weights based on feature block proximity.	Cascade-CLIP significantly improves zero-shot segmentation performance on COCO-Stuff, Pascal-VOC, and Pascal-Context datasets. It effectively captures object details and boundaries by leveraging multi-level features, outperforming methods relying solely on last-layer features. The cascaded alignment with independent decoders and NGA module effectively addresses the challenge of feature discrepancies between different layers in CLIP.	The performance improvement with increasing cascaded decoders plateaus after a certain point. Further exploration of optimal stage splitting and feature aggregation strategies within Cascade-CLIP is possible.	zero-shot learning, semantic segmentation, vision-language models, clip, multi-level features
2406.00637 Report	Representing Animatable Avatar via Factorized Neural Fields	Chunjin Song, Zhijie Wu, Bastian Wandt, Leonid Sigal, Helge Rhodin	For reconstructing high-fidelity human 3D models from monocular videos, it is crucial to maintain consistent large-scale body shapes along with finely matched subtle wrinkles. This paper explores the observation that the per-frame rendering results can be factorized into a pose-independent component and a corresponding pose-dependent equivalent to facilitate frame consistency. Pose adaptive textures can be further improved by restricting frequency bands of these two components. In detail, pose-independent outputs are expected to be low-frequency, while highfrequency information is linked to pose-dependent factors. We achieve a coherent preservation of both coarse body contours across the entire input video and finegrained texture features that are time variant with a dual-branch network with distinct frequency components. The first branch takes coordinates in canonical space as input, while the second branch additionally considers features outputted by the first branch and pose information of each frame. Our network integrates the information predicted by both branches and utilizes volume rendering to generate photo-realistic 3D human images. Through experiments, we demonstrate that our network surpasses the neural radiance fields (NeRF) based state-of-the-art methods in preserving high-frequency details and ensuring consistent body contours.	This paper introduces a novel two-branch neural network that factorizes animatable avatar rendering into pose-independent and pose-dependent components, associating them with low and high frequencies respectively, to improve avatar representation learning from monocular videos.	Reconstructing high-fidelity human avatars from monocular videos requires preserving both consistent large-scale body shapes and fine-grained, time-variant details like wrinkles, which is challenging for existing methods.	The method uses skeletal deformation to obtain canonical coordinates and employs a dual-branch network. One branch processes low-frequency pose-independent information, while the other handles high-frequency pose-dependent details. A common loss function encourages information maximization in the pose-independent branch. The final output merges both branches' results and uses SDF-based volume rendering to generate images.	The method outperforms state-of-the-art approaches in novel view synthesis, demonstrating superior texture detail and shape preservation. It exhibits significant improvement in novel pose rendering, generating more realistic and artifact-free results, particularly for challenging unseen poses. The approach excels in 3D shape reconstruction, capturing both smooth body surfaces and intricate geometric details like wrinkles more effectively.	The model's reliance on dense MLP computations within the volume rendering framework poses limitations on real-time applications. The current framework lacks explicit pattern editing capabilities.	avatar representation learning, neural rendering, monocular human reconstruction, frequency-aware factorization, signed distance function (sdf)
2406.00633 Report	Improving GFlowNets for Text-to-Image Diffusion Alignment	Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind, Navdeep Jaitly, Shuangfei Zhai	Diffusion models have become the \textit{de-facto} approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples. In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability -- a natural scenario for the framework of generative flow networks (GFlowNets). To this end, we propose the \textbf{D}iffusion \textbf{A}lignment with \textbf{G}FlowNet (DAG) algorithm to post-train diffusion models with black-box property functions. Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.	Presents DAG, a novel GFlowNet-based algorithm, for post-training text-to-image diffusion models to optimize black-box reward functions, improving large-scale text-to-image alignment.	Addresses the limitation of traditional diffusion models in controlling generation towards outputs with specific, desirable properties defined by reward functions, crucial in fields like drug discovery.	Leverages GFlowNets to train generative models to produce objects with probability proportional to a reward function, proposing both a DB-based objective and a novel KL-based objective with REINFORCE gradient.	DAG effectively incorporates reward characteristics into generated images, improving aesthetics, compressibility, and text-image alignment. Both DAG-DB and DAG-KL demonstrate significantly faster credit assignment than the DDPO baseline across various reward functions. Qualitative analysis showcases DAG's ability to gradually improve alignment over training, handling complex concepts and relationships better than the baseline.	The current implementation uses single-step transitions due to GPU memory constraints, limiting exploration of more sophisticated GFlowNet objectives. Future work could explore using DAG for posterior approximate inference, treating the reward function as likelihood information.	diffusion models, text-to-image synthesis, gflownets, reinforcement learning, generative ai
2406.00609 Report	SuperGaussian: Repurposing Video Models for 3D Super Resolution	Yuan Shen, Duygu Ceylan, Paul Guerrero, Zexiang Xu, Niloy J. Mitra, Shenlong Wang, Anna Frühstück	We present a simple, modular, and generic method that upsamples coarse 3D models by adding geometric and appearance details. While generative 3D models now exist, they do not yet match the quality of their counterparts in image and video domains. We demonstrate that it is possible to directly repurpose existing (pretrained) video models for 3D super-resolution and thus sidestep the problem of the shortage of large repositories of high-quality 3D training models. We describe how to repurpose video upsampling models, which are not 3D consistent, and combine them with 3D consolidation to produce 3D-consistent results. As output, we produce high quality Gaussian Splat models, which are object centric and effective. Our method is category agnostic and can be easily incorporated into existing 3D workflows. We evaluate our proposed SuperGaussian on a variety of 3D inputs, which are diverse both in terms of complexity and representation (e.g., Gaussian Splats or NeRFs), and demonstrate that our simple method significantly improves the fidelity of the final 3D models. Check our project website for details: supergaussian.github.io	Presents \methodname, a simple and generic method that leverages pre-trained video upsampling models to perform 3D super-resolution, enhancing the resolution and detail of coarse 3D models in a category-agnostic manner.	Current generative 3D models lag behind their image and video counterparts in quality due to limitations in 3D representation and the availability of large, high-quality 3D training datasets. This method overcomes these challenges by repurposing readily available video models.	Renders a video of the coarse 3D input from multiple viewpoints, upsamples the video using a pre-trained video upsampler (optionally fine-tuned on 3D data), and reconstructs a 3D-consistent output in the form of Gaussian Splats.	Demonstrates superior performance over image-based upsampling methods both qualitatively and quantitatively. Successfully upsamples diverse 3D inputs, including Gaussian Splats, NeRFs, low-poly meshes, and noisy 3D reconstructions. Shows improved performance after fine-tuning the video upsampler on a dataset of low-resolution Gaussian Splats.	Limited by the generalization and inference speed of pre-trained video models. Unable to recover missing or occluded parts in the input 3D model, requiring sufficient viewpoint coverage.	3d super-resolution, video upsampling, category-agnostic, 3d generation, gaussian splatting
2406.00598 Report	Efficient Neural Light Fields (ENeLF) for Mobile Devices	Austin Peng	Novel view synthesis (NVS) is a challenge in computer vision and graphics, focusing on generating realistic images of a scene from unobserved camera poses, given a limited set of authentic input images. Neural radiance fields (NeRF) achieved impressive results in rendering quality by utilizing volumetric rendering. However, NeRF and its variants are unsuitable for mobile devices due to the high computational cost of volumetric rendering. Emerging research in neural light fields (NeLF) eliminates the need for volumetric rendering by directly learning a mapping from ray representation to pixel color. NeLF has demonstrated its capability to achieve results similar to NeRF but requires a more extensive, computationally intensive network that is not mobile-friendly. Unlike existing works, this research builds upon the novel network architecture introduced by MobileR2L and aggressively applies a compression technique (channel-wise structure pruning) to produce a model that runs efficiently on mobile devices with lower latency and smaller sizes, with a slight decrease in performance.	ENeLF compresses a neural light field (NeLF) network to enable real-time novel view synthesis on mobile devices, sacrificing minimal performance for efficiency.	NeRF methods are computationally expensive, hindering mobile deployment. NeLFs, while faster, still present challenges in model size and latency. ENeLF addresses these limitations.	ENeLF leverages MobileR2L's efficient CNN backbone and super-resolution modules, incorporating channel-wise structure pruning and reordering BN and CONV layers for compression.	ENeLF achieves significant reductions in model parameters, FLOPs, and size compared to MobileR2L. It maintains competitive performance with slightly lower PSNR, SSIM, and LPIPS scores. Pruning enables faster inference speeds, suitable for mobile devices.	Training time for ENeLF remains high due to data distillation. The pruned model exhibits some loss of detail rendering, particularly fine-grained features.	novel view synthesis, neural light field, pruning, mobile devices, real-time rendering
2406.00508 Report	FlowIE: Efficient Image Enhancement via Rectified Flow	Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, Jiwen Lu	Image enhancement holds extensive applications in real-world scenarios due to complex environments and limitations of imaging devices. Conventional methods are often constrained by their tailored models, resulting in diminished robustness when confronted with challenging degradation conditions. In response, we propose FlowIE, a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. Unlike previous diffusion-based methods that suffer from long-time inference, FlowIE constructs a linear many-to-one transport mapping via conditioned rectified flow. The rectification straightens the trajectories of probability transfer, accelerating inference by an order of magnitude. This design enables our FlowIE to fully exploit rich knowledge in the pre-trained diffusion model, rendering it well-suited for various real-world applications. Moreover, we devise a faster inference algorithm, inspired by Lagrange's Mean Value Theorem, harnessing midpoint tangent direction to optimize path estimation, ultimately yielding visually superior results. Thanks to these designs, our FlowIE adeptly manages a diverse range of enhancement tasks within a concise sequence of fewer than 5 steps. Our contributions are rigorously validated through comprehensive experiments on synthetic and real-world datasets, unveiling the compelling efficacy and efficiency of our proposed FlowIE. Code is available at https://github.com/EternalEvan/FlowIE.	This paper proposes FlowIE, a flow-based image enhancement framework that uses rectified flow to leverage the generative priors of pre-trained diffusion models for fast and high-quality image enhancement.	Existing image enhancement methods, including predictive, GAN-based, and diffusion-based methods, struggle with either robustness, efficiency, or adaptability. FlowIE addresses these limitations by combining the strengths of pre-trained diffusion models with the efficiency of rectified flow.	FlowIE employs a conditioned rectified flow model to learn a many-to-one mapping from a simple elementary distribution to clean images. It uses an initial stage model for coarse restoration, a ControlNet branch for guidance, and a mean value sampling technique for accurate path prediction.	FlowIE achieves state-of-the-art results on blind face restoration, surpassing previous methods on FID and IDS while maintaining competitive scores on other metrics. FlowIE demonstrates superior performance on blind image super-resolution, achieving high MANIQA scores and exhibiting efficient inference speed comparable to one-step GAN-based methods. FlowIE shows strong generalization capabilities, effectively extending to tasks like face color enhancement and face inpainting with minimal fine-tuning.	The performance of FlowIE may be compromised when dealing with images that have undergone extremely severe degradation. The inference speed could be further enhanced by exploring more efficient sampling strategies or alternative flow-based models.	image enhancement, rectified flow, diffusion model, generative prior, mean value sampling
2406.00505 Report	Improving Text Generation on Images with Synthetic Captions	Jun Young Koh, Sang Hyun Park, Joy Song	The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets. The proposed strategy employs a fine-tuning technique that examines the effects of data refinement levels and synthetic captions. Moreover, our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios without the need of additional multimodal encoders. Our experiments show that with the addition of random letters to our raw dataset, our model's performance improves in producing well-formed visual text.	This paper introduces a low-cost fine-tuning approach for SDXL to enhance the generation of legible text within images, leveraging synthetic captions and data refinement.	Generating accurate text within images remains a challenge for text-to-image diffusion models, hindering their application in tasks demanding clear visual text.	The study explores fine-tuning SDXL with varying ratios of original data, synthetic captions (random characters and detailed descriptions), and refined captions (manual and automatic).	Data refinement level significantly impacts the model's ability to render accurate text. Adding synthetic data with random characters improves performance, especially with large datasets. Solely relying on synthetic data leads to performance degradation, highlighting the importance of real data.	Over-reliance on synthetic data can lead to mode collapse, necessitating further investigation into diverse synthetic data. The model exhibits semantic leakage, struggling to disentangle text content from visual attributes, requiring exploration of dense captions and diverse datasets.	synthetic data, diffusion models, text generation, image generation, multimodal learning
2406.00457 Report	The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP	Hidir Yesiltepe, Yusuf Dalva, Pinar Yanardag	Diffusion models have become prominent in creating high-quality images. However, unlike GAN models celebrated for their ability to edit images in a disentangled manner, diffusion-based text-to-image models struggle to achieve the same level of precise attribute manipulation without compromising image coherence. In this paper, CLIP which is often used in popular text-to-image diffusion models such as Stable Diffusion is capable of performing disentangled editing in a zero-shot manner. Through both qualitative and quantitative comparisons with state-of-the-art editing methods, we show that our approach yields competitive results. This insight may open opportunities for applying this method to various tasks, including image and video editing, providing a lightweight and efficient approach for disentangled editing.	This paper reveals that CLIP, a popular model used in text-to-image diffusion models, can function as a zero-shot image editing tool via its EOS token.	Diffusion models excel at generating high-quality images but struggle with disentangled editing (changing specific attributes without affecting others) unlike GANs. This work offers a simple, efficient approach for disentangled editing within diffusion models.	The method leverages the EOS (end-of-sentence) token representation from CLIP's text encoder to modify the source text embedding, guiding the diffusion model to generate an image reflecting the desired attribute change.	The EOS token method achieves comparable qualitative results to state-of-the-art editing methods like SEGA, Ledits++, and Cycle Diffusion. It demonstrates effectiveness in various editing tasks, including facial attribute changes, background replacement, and NSFW content moderation. A user study confirms its competitiveness in terms of edit quality and disentanglement capabilities.	The method inherits CLIP's biases which might lead to unintended attribute changes. Further exploration is needed to fully understand and exploit the potential of CLIP's EOS token for image editing.	diffusion models, image editing, disentangled editing, clip, zero-shot learning
2406.00449 Report	Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging	Jiahua Dong, Hui Yin, Hongliu Li, Wenbo Li, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan	Deep unfolding methods have made impressive progress in restoring 3D hyperspectral images (HSIs) from 2D measurements through convolution neural networks or Transformers in spectral compressive imaging. However, they cannot efficiently capture long-range dependencies using global receptive fields, which significantly limits their performance in HSI reconstruction. Moreover, these methods may suffer from local context neglect if we directly utilize Mamba to unfold a 2D feature map as a 1D sequence for modeling global long-range dependencies. To address these challenges, we propose a novel Dual Hyperspectral Mamba (DHM) to explore both global long-range dependencies and local contexts for efficient HSI reconstruction. After learning informative parameters to estimate degradation patterns of the CASSI system, we use them to scale the linear projection and offer noise level for the denoiser (i.e., our proposed DHM). Specifically, our DHM consists of multiple dual hyperspectral S4 blocks (DHSBs) to restore original HSIs. Particularly, each DHSB contains a global hyperspectral S4 block (GHSB) to model long-range dependencies across the entire high-resolution HSIs using global receptive fields, and a local hyperspectral S4 block (LHSB) to address local context neglect by establishing structured state-space sequence (S4) models within local windows. Experiments verify the benefits of our DHM for HSI reconstruction. The source codes and models will be available at https://github.com/JiahuaDong/DHM.	This paper presents Dual Hyperspectral Mamba (DHM), a novel deep unfolding method for reconstructing Hyperspectral Images (HSIs) from compressed measurements acquired by a Coded Aperture Snapshot Spectral Imaging (CASSI) system.	Existing deep unfolding methods for HSI reconstruction struggle to efficiently capture long-range dependencies and often neglect local context, limiting their performance.	DHM employs a multi-stage unfolding framework. It learns parameters to estimate degradation patterns of the CASSI system, which are used to scale linear projections and provide noise levels for the denoiser. The core of DHM is the Dual Hyperspectral S4 block (DHSB), consisting of a global hyperspectral S4 block (GHSB) to model long-range dependencies with global receptive fields and a local hyperspectral S4 block (LHSB) to address local context neglect by applying S4 models within local windows.	DHM significantly outperforms state-of-the-art deep unfolding methods for HSI reconstruction in both quantitative and qualitative evaluations. The method effectively captures both global and local contexts, leading to improved restoration of fine details and reduced artifacts. DHM achieves superior performance while maintaining lower model complexity and computational cost compared to existing approaches.	The paper assumes a specific degradation model of the CASSI system, which might limit its generalizability to other compressive imaging systems. Future work could explore extending DHM to handle different noise models and incorporate other priors for HSI reconstruction.	hyperspectral image reconstruction, deep unfolding, coded aperture snapshot spectral imaging (cassi), structured state space sequence (s4) models, global and local context modeling
2406.00448 Report	Bilateral Guided Radiance Field Processing	Yuehao Wang, Chaoyi Wang, Bingchen Gong, Tianfan Xue	Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone mapping, etc. While these processings greatly improve image quality, they often break the multi-view consistency assumption, leading to "floaters" in the reconstructed radiance fields. To address this concern without compromising visual aesthetics, we aim to first disentangle the enhancement by ISP at the NeRF training stage and re-apply user-desired enhancements to the reconstructed radiance fields at the finishing stage. Furthermore, to make the re-applied enhancements consistent between novel views, we need to perform imaging signal processing in 3D space (i.e. "3D ISP"). For this goal, we adopt the bilateral grid, a locally-affine model, as a generalized representation of ISP processing. Specifically, we optimize per-view 3D bilateral grids with radiance fields to approximate the effects of camera pipelines for each input view. To achieve user-adjustable 3D finishing, we propose to learn a low-rank 4D bilateral grid from a given single view edit, lifting photo enhancements to the whole 3D scene. We demonstrate our approach can boost the visual quality of novel view synthesis by effectively removing floaters and performing enhancements from user retouching. The source code and our data are available at: https://bilarfpro.github.io.	This paper introduces a bilateral guided training and finishing approach for Neural Radiance Fields (NeRF) to address photometric inconsistencies and enable advanced editing.	Modern camera image signal processing (ISP) introduces inconsistencies across multi-view images, causing artifacts in NeRF reconstructions. This work aims to disentangle and leverage ISP effects for improved quality and editing.	The authors employ differentiable 3D bilateral grids to approximate per-view ISP enhancements during NeRF training. For finishing, a novel low-rank 4D bilateral grid lifts 2D view edits to the 3D scene.	The method achieves state-of-the-art novel view synthesis quality on challenging scenes with significant photometric variation. It effectively removes floaters caused by inconsistent ISP processing across views. The 4D bilateral grid enables consistent and intuitive 3D scene retouching by lifting 2D editing operations.	The approach struggles to handle transient objects like moving clouds. Lifting sophisticated local edits with high fidelity remains a challenge.	neural radiance fields, novel view synthesis, image signal processing, bilateral grid, 3d scene editing
2406.00434 Report	MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos	Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lv, Peng Wang, Wenping Wang, Junhui Hou	In this paper, we propose MoDGS, a new pipeline to render novel-view images in dynamic scenes using only casually captured monocular videos. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid movement of input cameras to construct multiview consistency but fail to reconstruct dynamic scenes on casually captured input videos whose cameras are static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms baseline methods by a significant margin.	MoDGS introduces a novel pipeline for rendering novel-view images of dynamic scenes from casually captured monocular videos, addressing the limitations of previous methods that rely on large camera movements.	Existing monocular dynamic view synthesis methods struggle with casually captured videos where camera movement is limited, hindering accurate 3D scene reconstruction.	MoDGS leverages single-view depth estimation for 3D guidance and introduces a 3D-aware initialization scheme for the deformation field. It further enhances depth supervision using a novel ordinal depth loss that accounts for scale inconsistencies across frames.	MoDGS successfully synthesizes high-quality novel-view images from casually captured monocular videos, outperforming baseline methods. The 3D-aware initialization scheme significantly improves reconstruction quality compared to random initialization. The ordinal depth loss proves more robust than traditional depth losses, leading to smoother depth maps and sharper edge preservation.	MoDGS struggles to reconstruct unseen regions, leading to artifacts in novel views. Training time remains comparable to existing DVS methods and heavily relies on single-view depth estimation accuracy.	novel view synthesis, monocular video, dynamic scenes, gaussian splatting, depth estimation
2406.00432 Report	Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner	Xing Cui, Peipei Li, Zekun Li, Xuannan Liu, Yueying Zou, Zhaofeng He	Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning ``how to drag'' through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results may correspond to a given input, as illustrated in Fig.1; 2) Ignoring the constraint of image quality, which may lead to unexpected distortion. To alleviate this, we propose LucidDrag, which shifts the focus from ``how to drag'' to a paradigm of ``what-then-how''. LucidDrag comprises an intention reasoner and a collaborative guidance sampling mechanism. The former infers several optimal editing strategies, identifying what content and what semantic direction to be edited. Based on the former, the latter addresses "how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance. Specifically, semantic guidance is derived by establishing a semantic editing direction based on reasoned intentions, while quality guidance is achieved through classifier guidance using an image fidelity discriminator. Both qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods. The code will be released.	This paper introduces LucidDrag, a novel framework for drag-based image editing that shifts from a "how to drag" to a "what-then-how" paradigm.	Existing drag-based editing methods often produce deterministic results and may neglect the semantic ambiguity of drag intentions and the preservation of image quality. LucidDrag addresses these limitations by first understanding the user's editing intention.	LucidDrag consists of an intention reasoner and a collaborative guidance sampling mechanism. The intention reasoner, using LVLM and LLM, deduces possible editing intentions. The collaborative guidance sampling combines editing guidance with semantic and quality guidance based on the reasoned intentions, ensuring both accurate and high-quality editing.	LucidDrag demonstrates superior semantic understanding and generates diverse editing results aligned with user intentions. Quantitative evaluations show that LucidDrag outperforms existing methods in both dragging accuracy and image quality. Ablation studies confirm the importance of the intention reasoner and the quality guidance for achieving high-quality and semantically accurate editing results.	Dragging complex objects over long distances remains challenging due to limitations in object comprehension and tracking. Manually tuning hyperparameters can be sub-optimal and future work could explore LLM-based automatic hyperparameter determination.	image editing, drag-based editing, diffusion models, large language models, semantic understanding
2406.00427 Report	You Only Need Less Attention at Each Stage in Vision Transformers	Shuoxi Zhang, Hanpeng Liu, Stephen Lin, Kun He	The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation, merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover, our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.	This paper proposes LaViT, a novel Vision Transformer architecture that enhances efficiency by re-parameterizing attention scores from previous layers, thus mitigating computational burden and attention saturation	Addressing the quadratic computational complexity and attention saturation issues in Vision Transformers is crucial for their practical application in computer vision tasks	LaViT employs Less Attention layers that apply transformations to previously computed attention scores, uses residual connections for attention downsampling across stages, and introduces a Diagonality Preserving loss to maintain inter-token relationships in the transformed attention matrices	LaViT achieves state-of-the-art performance on ImageNet-1K classification with reduced computational cost compared to existing ViT models It demonstrates superior object detection results on COCO2017, outperforming both CNN and Transformer counterparts LaViT also excels in semantic segmentation on ADE20K, surpassing Swin Transformer in terms of mIoU while being computationally more efficient	The selection of the starting layer for Less Attention in deep ViTs needs careful consideration for optimal performance Further investigation into alternative transformation functions for attention re-parameterization could potentially yield additional benefits	vision transformer, self-attention, computational efficiency, attention saturation, image classification, object detection, semantic segmentation
2406.00272 Report	Temporally Consistent Object Editing in Videos using Extended Attention	AmirHossein Zamani, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky	Image generation and editing have seen a great deal of advancements with the rise of large-scale diffusion models that allow user control of different modalities such as text, mask, depth maps, etc. However, controlled editing of videos still lags behind. Prior work in this area has focused on using 2D diffusion models to globally change the style of an existing video. On the other hand, in many practical applications, editing localized parts of the video is critical. In this work, we propose a method to edit videos using a pre-trained inpainting image diffusion model. We systematically redesign the forward path of the model by replacing the self-attention modules with an extended version of attention modules that creates frame-level dependencies. In this way, we ensure that the edited information will be consistent across all the video frames no matter what the shape and position of the masked area is. We qualitatively compare our results with state-of-the-art in terms of accuracy on several video editing tasks like object retargeting, object replacement, and object removal tasks. Simulations demonstrate the superior performance of the proposed strategy.	This paper presents a new method for temporally consistent video editing using a pre-trained inpainting image diffusion model with mask and text guidance.	Controlled editing of localized regions in videos while maintaining temporal consistency remains a challenge. Existing methods struggle with inconsistencies across frames, especially when masks change shape or position, and often require costly fine-tuning or training.	The authors extend the self-attention mechanism in a pre-trained inpainting diffusion model to incorporate frame-level dependencies. This allows the model to consider information from multiple frames during the editing process, leading to temporally consistent results.	The method achieves high-quality object replacement, a task not addressed by previous mask-guided approaches. It demonstrates competitive performance on object removal, matching the visual fidelity of state-of-the-art methods. The approach excels in consistent video object retargeting, surpassing existing techniques in visual quality and temporal coherence.	While achieving competitive results on object removal, there is room for improvement to match state-of-the-art quantitative performance. Future work could explore generalizing the method to a wider range of video editing tasks beyond the ones explored in this paper.	video editing, diffusion models, temporal consistency, inpainting, object retargeting
2406.00259 Report	PuzzleFusion++: Auto-agglomerative 3D Fracture Assembly by Denoise and Verify	Zhengqing Wang, Jiacheng Chen, Yasutaka Furukawa	This paper proposes a novel "auto-agglomerative" 3D fracture assembly method, PuzzleFusion++, resembling how humans solve challenging spatial puzzles. Starting from individual fragments, the approach 1) aligns and merges fragments into larger groups akin to agglomerative clustering and 2) repeats the process iteratively in completing the assembly akin to auto-regressive methods. Concretely, a diffusion model denoises the 6-DoF alignment parameters of the fragments simultaneously, and a transformer model verifies and merges pairwise alignments into larger ones, whose process repeats iteratively. Extensive experiments on the Breaking Bad dataset show that PuzzleFusion++ outperforms all other state-of-the-art techniques by significant margins across all metrics, in particular by over 10% in part accuracy and 50% in Chamfer distance. The code will be available on our project page: https://puzzlefusion-plusplus.github.io.	Presents \ourmethod, an auto-agglomerative 3D fracture assembly method that simulates human puzzle-solving by iteratively aligning and merging fragments into larger groups using a diffusion model and a transformer for verification.	Addresses the challenging problem of 3D fracture assembly, with applications in archaeology, forensics, biochemistry, and more.	Uses PointNet++ and VQ-VAE to encode fragments, a diffusion model to denoise 6-DoF alignment parameters, and a transformer to verify pairwise alignments and merge them.	\ourmethod significantly outperforms six state-of-the-art methods across all metrics on the Breaking Bad dataset, including over 10% improvement in part accuracy and over 50% in Chamfer distance. The auto-agglomerative process is shown to be crucial for handling complex assemblies with many fragments. The method demonstrates robustness even with fewer sampling steps in the diffusion model.	Limitations include challenges with local geometric ambiguity and small fracture surfaces leading to misaligned fragments. Future work will focus on improving inference speed and scaling to assemblies with up to 100 fragments.	3d fracture assembly, diffusion models, transformers, auto-agglomerative, point cloud processing
2406.00258 Report	Artemis: Towards Referential Understanding in Complex Videos	Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian	Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that \model can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis.	Introduces Artemis, a multimodal large language model (MLLM) baseline for fine-level video understanding, specifically video-based referential understanding.	Existing MLLMs fall short in referential understanding scenarios for videos, lacking the ability to comprehend and describe target actions in complex, longer videos.	Utilizes a three-stage training approach: video-text pre-training, video-based instruction tuning, and video-based referring instruction tuning. Employs RoI tracking and selection to extract compact, target-specific video features, reducing redundancy and enhancing training efficiency.	Outperforms existing MLLMs in video-based referring benchmarks, demonstrating superior comprehensiveness and accuracy in describing target actions. Serves as a building block for complex video understanding tasks, including multi-round dialogues with grounding and long video understanding with summarization. Achieves competitive performance in general video question answering tasks, highlighting the transferability of its fine-level understanding capabilities.	Reliance on external tracking algorithms for RoI generation can introduce inaccuracies, impacting overall performance. Susceptibility to general video understanding challenges like spatial-temporal aliasing, which can lead to inaccurate descriptions of visual content.	multimodal large language models, video understanding, referential understanding, video-based referring, roi tracking and selection
2406.00121 Report	Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations	Tiancheng Shen, Jun Hao Liew, Long Mai, Lu Qi, Jiashi Feng, Jiaya Jia	Advances in text-based image generation and editing have revolutionized content creation, enabling users to create impressive content from imaginative text prompts. However, existing methods are not designed to work well with the oversimplified prompts that are often encountered in typical scenarios when users start their editing with only vague or abstract purposes in mind. Those scenarios demand elaborate ideation efforts from the users to bridge the gap between such vague starting points and the detailed creative ideas needed to depict the desired results. In this paper, we introduce the task of Image Editing Recommendation (IER). This task aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose. To this end, we introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation. We train Creativity-VLA on our edit-instruction dataset specifically curated for IER. We further enhance our model with a novel 'token-for-localization' mechanism, enabling it to support both global and local editing operations. Our experimental results demonstrate the effectiveness of \ours{} in suggesting instructions that not only contain engaging creative elements but also maintain high relevance to both the input image and the user's initial hint.	This paper introduces Image Editing Recommendation (IER), a novel task to bridge the creativity gap in image editing by automatically generating diverse creative editing instructions from an input image and a simple user prompt.	Existing image editing tools often require detailed instructions, making it challenging for users with vague ideas to achieve their desired results. This work aims to ease the ideation process and make image editing more accessible.	The authors propose Creativity-VLA, a multimodal framework trained on a curated instruction dataset. This framework leverages a Vision Language Model (VLM) for visual understanding and creative reasoning, and employs a novel 'token-for-localization' mechanism to support both global and local image edits.	Creativity-VLA outperforms existing image editing tools (MagicBrush, InstructDiffusion) and VLMs (LLaVA-v1.5, GPT-4V) in generating creative and relevant editing suggestions based on user study. The proposed method effectively bridges the gap between vague editing hints and concrete instructions, as demonstrated by improved CLIP similarity scores and qualitative comparisons. The 'token-for-localization' mechanism enables Creativity-VLA to suggest both global and local edits, broadening its applicability and allowing for more fine-grained control over image modifications.	The current model sometimes struggles to balance image alignment with substantial modifications based on user feedback. Future work could explore incorporating user feedback during the editing process for iterative improvement and personalization.	image editing, vision-language model, creativity, instruction generation, token-for-localization
2406.00093 Report	Bootstrap3D: Improving 3D Content Creation with Synthetic Data	Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang	Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.	This paper proposes Bootstrap3D, a framework leveraging Multimodal Large Language Models (MLLMs) and diffusion models to generate high-quality synthetic data for training multi-view diffusion models, addressing the scarcity of high-quality 3D data with detailed captions.	This is important because the lack of high-quality 3D data hinders the development of 3D content creation models, leading to lower quality and less diverse results compared to 2D models.	The method consists of 1) a data generation pipeline using 2D/video diffusion models and a fine-tuned 3D-aware MV-LLaVA for data generation, filtering, and caption rewriting, and 2) a Training Timestep Reschedule (TTR) strategy to fine-tune multi-view diffusion models using both synthetic and real data.	Bootstrap3D generates 1 million multi-view images with detailed captions, suitable for training multi-view diffusion models. The framework significantly improves text-to-3D generation quality, achieving better image-text alignment, higher visual fidelity, and improved view consistency. Quantitative evaluations show Bootstrap3D outperforms state-of-the-art methods on various metrics, including CLIP score, CLIP-R score, and FID.	Current sparse view reconstruction models, mainly trained on limited datasets like Objaverse, may not fully utilize the potential of the generated data. Detecting subtle view inconsistencies remains challenging, potentially leading to blurred areas in the final 3D reconstructions.	3d content creation, multi-view diffusion models, synthetic data generation, multimodal large language models, data augmentation
2405.21075 Report	Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun	In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io	This paper introduces Video-MME, the first comprehensive multi-modal benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) on video understanding tasks.	Current MLLMs primarily focus on static image understanding. Evaluating MLLMs on video data is crucial for assessing their ability to handle the dynamic nature of real-world scenarios, paving the way for artificial general intelligence.	The authors curated a dataset of 900 videos across diverse scenarios, annotated with 2,700 multiple-choice questions. These videos vary in duration (11 seconds to 1 hour) and are enriched with subtitles and audio tracks. They benchmarked state-of-the-art MLLMs, including GPT-4, Gemini 1.5 Pro, and open-source models, using accuracy as the evaluation metric.	Gemini 1.5 Pro is the best-performing commercial model (75.7% accuracy), significantly outperforming open-source models. Integrating subtitles and audio significantly enhances video understanding, particularly for longer videos. MLLM performance declines as video duration increases, indicating limitations in processing long sequences and highlighting the need for architectural and training data improvements.	The study primarily focuses on multiple-choice questions, potentially limiting the assessment of MLLMs' generative capabilities for video understanding. Future work includes developing more robust MLLM architectures for long context modeling and creating datasets with more complex temporal reasoning scenarios.	multi-modal large language models, video understanding, benchmarking, temporal reasoning, multi-modal evaluation
2405.21074 Report	Latent Intrinsics Emerge from Training to Relight	Xiao Zhang, William Gao, Seemandhar Jain, Michael Maire, David. A. Forsyth, Anand Bhattad	Image relighting is the task of showing what a scene from a source image would look like if illuminated differently. Inverse graphics schemes recover an explicit representation of geometry and a set of chosen intrinsics, then relight with some form of renderer. However error control for inverse graphics is difficult, and inverse graphics methods can represent only the effects of the chosen intrinsics. This paper describes a relighting method that is entirely data-driven, where intrinsics and lighting are each represented as latent variables. Our approach produces SOTA relightings of real scenes, as measured by standard metrics. We show that albedo can be recovered from our latent intrinsics without using any example albedos, and that the albedos recovered are competitive with SOTA methods.	This paper introduces a novel data-driven image relighting method that learns latent representations of scene intrinsics and lighting conditions for relighting images of real scenes.	Existing inverse graphics-based relighting methods face challenges in error control and are limited in representing complex lighting effects. This work explores a purely data-driven approach for accurate and generalizable relighting.	The method utilizes an autoencoder framework with two encoders to extract intrinsic features from a target scene image and extrinsic features from a reference lighting image. A constrained scaling mechanism combines these features, restricting information flow from the reference image to prevent feature leakage. The decoder then generates the relit image.	The method achieves state-of-the-art relighting accuracy on a real-world dataset, outperforming existing unsupervised methods and competing with supervised approaches. The learned latent intrinsic representation enables zero-shot albedo estimation, achieving competitive results with state-of-the-art albedo estimation methods without requiring explicit albedo training data. The method successfully generalizes to synthetically generated images with significant lighting variations, demonstrating its ability to infer high-level lighting concepts.	The method currently relies on paired relighting data from the same scene, which can be resource-intensive to acquire. The latent representation of intrinsics poses challenges for applications requiring explicit intrinsic information like depth or normals.	image relighting, intrinsic image decomposition, unsupervised learning, deep learning, computer vision
2405.21066 Report	Mixed Diffusion for 3D Indoor Scene Synthesis	Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, Federico Tombari	Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.	MiDiffusion, a novel mixed discrete-continuous diffusion model for synthesizing plausible 3D indoor scenes from room types, floor plans, and potentially pre-existing objects.	Realistic conditional 3D scene synthesis accelerates the creation of virtual environments and provides training data for computer vision and robotics.	Combines Denoising Diffusion Probabilistic Models (DDPM) for continuous geometric attributes and Discrete Denoising Diffusion Probabilistic Models (D3PM) for discrete semantic labels. Employs a time-variant transformer-based denoising network conditioned on floor plan features.	MiDiffusion outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis on the 3D-FRONT dataset. Generates more realistic scene layouts with accurate geometric arrangement and adherence to boundary constraints. Handles partial object constraints (e.g., scene completion) via a corruption-and-masking strategy without task-specific training.	Current object representation as bounding box features and labels is not highly precise for 3D. Requires a model retrieving strategy to compose the final 3D scene.	3d scene synthesis, diffusion models, mixed discrete-continuous, floor plan conditioned, scene completion
2405.21059 Report	Unified Directly Denoising for Both Variance Preserving and Variance Exploding Diffusion Models	Jingjing Wang, Dan Zhang, Feng Luo	Previous work has demonstrated that, in the Variance Preserving (VP) scenario, the nascent Directly Denoising Diffusion Models (DDDM) can generate high-quality images in one step while achieving even better performance in multistep sampling. However, the Pseudo-LPIPS loss used in DDDM leads to concerns about the bias in assessment. Here, we propose a unified DDDM (uDDDM) framework that generates images in one-step/multiple steps for both Variance Preserving (VP) and Variance Exploding (VE) cases. We provide theoretical proofs of the existence and uniqueness of the model's solution paths, as well as the non-intersecting property of the sampling paths. Additionally, we propose an adaptive Pseudo-Huber loss function to balance the convergence to the true solution and the stability of convergence process.Through a comprehensive evaluation, we demonstrate that uDDDMs achieve FID scores comparable to the best-performing methods available for CIFAR-10 in both VP and VE. Specifically, uDDDM achieves one-step generation on CIFAR10 with FID of 2.63 and 2.53 for VE and VP respectively. By extending the sampling to 1000 steps, we further reduce FID score to 1.71 and 1.65 for VE and VP respectively, setting state-of-the-art performance in both cases.	This paper introduces uDDDM, a unified Directly Denoising Diffusion Model framework that generates high-quality images in one or multiple steps for both Variance Preserving (VP) and Variance Exploding (VE) diffusion processes.	The work addresses limitations of existing one-step generative models like Consistency Models and TRACT, aiming to improve efficiency and quality of image generation with diffusion models.	The paper proposes a unified framework for VP and VE diffusion, introduces an adaptive Pseudo-Huber loss function for training, and provides theoretical proofs for properties like existence, uniqueness, and non-intersection of solution paths.	uDDDMs achieve FID scores comparable to the best-performing methods for CIFAR-10 in both VP and VE. The model achieves one-step generation on CIFAR10 with FID of 2.63 (VE) and 2.53 (VP), outperforming StyleGAN2-ADA. Extending sampling to 1000 steps further reduces FID to 1.71 (VE) and 1.65 (VP), setting new state-of-the-art performance.	Training uDDDM requires additional memory to store intermediate estimations, posing challenges for large datasets. The VE model consistently underperforms compared to the VP model, potentially due to suboptimal loss function hyperparameters and noise scheduling strategies.	diffusion models, generative models, image generation, one-step generation, variance exploding/preserving sde
2405.21050 Report	Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models	Xinxi Zhang, Song Wen, Ligong Han, Felix Juefei-Xu, Akash Srivastava, Junzhou Huang, Hao Wang, Molei Tao, Dimitris N. Metaxas	Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA's effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.	This paper introduces SODA, a novel spectrum-aware adaptation framework for generative models, which improves parameter efficiency by leveraging the spectral space of pre-trained weights.	Adapting large-scale pre-trained generative models like Stable Diffusion to specific tasks requires parameter-efficient fine-tuning methods that can capture complex data representations without extensive retraining.	SODA adjusts both singular values and singular vectors during fine-tuning, employing a Kronecker product to rotate the singular vectors for parameter efficiency. It utilizes SVD or LQ/QR decomposition to decompose pre-trained weights and updates spectral and basis components separately.	SODA outperforms baselines like LoRA and OFT in subject and style personalization tasks for text-to-image diffusion models. Jointly adjusting magnitude and orientation of decomposed weights improves utilization of model priors and reduces overfitting. Stiefel optimizer used in SODA exhibits robustness and achieves better performance compared to Cayley parameterization.	SODA's training is slower than LoRA due to the Stiefel optimizer. Future work will focus on accelerating optimization algorithms and applying SODA to large language models.	parameter-efficient fine-tuning, generative models, text-to-image diffusion, spectrum-aware adaptation, stiefel optimization
2405.21048 Report	Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling	Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind	Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.	Kaleido, a novel approach that enhances the diversity of samples generated by diffusion models from textual descriptions by incorporating autoregressive latent priors.	Existing text-to-image diffusion models often lack diversity in their generated images, particularly when using high classifier-free guidance weights. This limits their practical applications where diverse visual interpretations are desired.	Kaleido utilizes an autoregressive language model to generate latent variables from the original caption. These variables (textual descriptions, bounding boxes, object blobs, or visual tokens) act as abstract representations to guide the diffusion model's image generation process.	Kaleido effectively broadens the diversity of generated image samples from a given textual description. Kaleido maintains high image quality comparable to standard diffusion models. The generated latent variables offer interpretability and control over the image generation process.	Training Kaleido can be more complex and resource-intensive than standard diffusion models. Identifying the most effective latent variables for optimal diversity might require extensive experimentation.	diffusion models, image generation, text-to-image synthesis, diversity, autoregressive models
2405.21013 Report	StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond	Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang	Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Language Models. We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images. The significant design of StrucTexTv3 is presented in the following aspects: Firstly, we adopt a combination of an effective multi-scale reduced visual transformer and a multi-granularity token sampler (MG-Sampler) as a visual token generator, successfully solving the challenges of high-resolution input and complex representation learning for text-rich images. Secondly, we enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning, seamlessly integrating various text-oriented tasks into a unified framework. Thirdly, we have curated a comprehensive collection of high-quality text-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like incidental scenes, office documents, web pages, and screenshots, thereby improving the robustness of our model. Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks. Among multimodal models with LLM decoder of approximately 1.8B parameters, it stands out as a leader, which also makes the deployment of edge devices feasible. In summary, the StrucTexTv3 model, featuring efficient structural design, outstanding performance, and broad adaptability, offers robust support for diverse intelligent application tasks involving text-rich images, thus exhibiting immense potential for widespread application.	This paper introduces StrucTexTv3, an efficient vision-language model designed for perception and comprehension tasks on text-rich images, addressing challenges of high-resolution inputs and complex representation learning.	Efficiently understanding text-rich images, crucial for information transmission in many aspects of human life, is a significant test for Vision-Language Models. Current methods struggle with high-resolution input and require large resources.	StrucTexTv3 leverages a hierarchical vision transformer, a multi-granularity token sampler (MG-Sampler), and a 1.8B parameter LLM. It's trained with TIM-30M, a 30 million text-rich image dataset, using a three-stage training pipeline: pre-training, multi-task pre-training, and supervised fine-tuning.	StrucTexTv3 achieves state-of-the-art performance on various benchmarks, including text spotting, document parsing, and key information extraction. It demonstrates competitive results in document-oriented VQA, table image understanding, and text image translation, outperforming models with significantly larger LLM sizes. The model's efficiency allows for potential deployment on edge devices.	Limited context handling for multi-page documents and videos. Further research needed on scaling laws for larger datasets and models.	vision-language model, text-rich images, high-resolution input, multimodal learning, instruction learning
2405.20985 Report	DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou	The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. This involves a first visual semantic abstraction by the projector referring to pre-defined query tokens, and a second extraction by the LLM based on text instructions. The double abstraction is inefficient in training and will result in cumulative vision semantics deficiency. To mitigate this issue, we propose the key insight of 'Decouple Compression from Abstraction (DeCo), that is compressing the visual token number at the patch level by projectors and allowing the LLM to handle visual semantic abstraction entirely. Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner. Empirical evaluation demonstrates that DeCo surpasses traditional compressive projectors regarding both performance and efficiency. It achieves performance gains of 0.9%, 7.1%, and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks with fewer trainable parameters and faster convergence speed.	This paper proposes DeCo, a novel method for Multimodal Large Language Models (MLLMs) that decouples compression from visual semantic abstraction, improving efficiency and spatial understanding.	Existing MLLM visual projectors suffer from a "double abstraction" problem, where visual semantics are redundantly extracted by both the projector and the LLM, leading to inefficiencies and semantic loss.	The paper introduces R-GAE, a new explainability tool to analyze vision-language semantic flow in MLLMs. It then proposes DeCo, which utilizes a simple Adaptive Average Pooling to compress visual tokens at the patch level, leaving semantic abstraction to the LLM.	DeCo outperforms existing compressive projectors on various MLLM benchmarks, visual localization, and open-ended VQA tasks. DeCo demonstrates faster training convergence compared to other compressive projectors due to its parameter-free compression mechanism. DeCo exhibits superior spatial understanding capabilities and robustness across different vision backbones, image resolutions, and LLMs.	High compression ratios in DeCo might lead to substantial visual information loss compared to semantic-level compression. The advantages of DeCo are more pronounced under limited training resources (GPUs and data), and its significance might diminish with abundant resources.	multimodal large language models, vision-language alignment, projector module, semantic abstraction, explainability
2405.20971 Report	Amortizing intractable inference in diffusion models for vision, language, and control	Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yoshua Bengio, Glen Berseth, Nikolay Malkin	Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data, $\mathbf{x}\sim p^{\rm post}(\mathbf{x})\propto p(\mathbf{x})r(\mathbf{x})$, in a model that consists of a diffusion generative model prior $p(\mathbf{x})$ and a black-box constraint or likelihood function $r(\mathbf{x})$. We state and prove the asymptotic correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning.	The paper proposes Relative Trajectory Balance (RTB), an asymptotically unbiased training objective for training diffusion models to sample from posterior distributions under a diffusion model prior.	Sampling from posteriors under diffusion priors is crucial in many downstream tasks across vision, language, and reinforcement learning, but existing methods are often approximate or limited in scope.	RTB leverages the generative flow network perspective on diffusion models and enforces a constraint on the ratio of denoising trajectories under the prior and posterior. The objective can be optimized off-policy, allowing flexible exploration of the posterior.	RTB achieves competitive classifier-guided image generation with unconditional diffusion priors and improves text-to-image generation under foundation model priors. RTB shows strong results for text infilling with discrete diffusion language models. RTB obtains state-of-the-art performance on continuous control benchmarks in offline reinforcement learning.	RTB relies on simulation-based training, which can be computationally intensive. The lack of local credit assignment in the RTB objective can lead to high variance gradients.	diffusion models, posterior sampling, generative flow networks, classifier guidance, offline reinforcement learning
2405.20853 Report	MeshXL: Neural Coordinate Field for Generative 3D Foundation Models	Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Yanru Wang, Zhibin Wang, Chi Zhang, Jingyi Yu, Gang Yu, Bin Fu, Tao Chen	The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate the Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models, which addresses the process of 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.	Introduces MeshXL, a family of auto-regressive transformer models for direct generation of high-fidelity 3D meshes using a novel Neural Coordinate Field (NeurCF) representation.	Addresses challenges in generating high-quality 3D meshes due to their unstructured graph representation and the need for accurate spatial and connectivity estimation.	Utilizes NeurCF, an explicit coordinate representation with implicit neural embeddings, and trains MeshXL models with a pre-defined ordering strategy for auto-regressive generation. Pre-trains models on a large dataset of 2.5M meshes from ShapeNet, 3D-FUTURE, Objaverse, and Objaverse-XL.	MeshXL outperforms prior arts in generating high-quality and diverse 3D meshes, as evidenced by quantitative metrics (COV, MMD, 1-NNA, JSD, FID, KID) on ShapeNet benchmark. Demonstrates effectiveness in downstream tasks like shape completion and conditional mesh generation from images or text. Shows improved performance with increasing model size and benefits from large-scale pre-training.	Inference time is a limitation due to the auto-regressive process. Future work can explore faster RNN-based methods or multi-token prediction to reduce inference cost.	3d mesh generation, neural coordinate field, auto-regressive models, generative pre-training, transformer
2405.20791 Report	GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis	Yumeng He, Yunbo Wang, Xiaokang Yang	Decoupling the illumination in 3D scenes is crucial for novel view synthesis and relighting. In this paper, we propose a novel method for representing a scene illuminated by a point light using a set of relightable 3D Gaussian points. Inspired by the Blinn-Phong model, our approach decomposes the scene into ambient, diffuse, and specular components, enabling the synthesis of realistic lighting effects. To facilitate the decomposition of geometric information independent of lighting conditions, we introduce a novel bilevel optimization-based meta-learning framework. The fundamental idea is to view the rendering tasks under various lighting positions as a multi-task learning problem, which our meta-learning approach effectively addresses by generalizing the learned Gaussian geometries not only across different viewpoints but also across diverse light positions. Experimental results demonstrate the effectiveness of our approach in terms of training efficiency and rendering quality compared to existing methods for free-viewpoint relighting.	This paper introduces Phong-Inspired Gaussian Illumination Decomposition (Phong-GID), a novel method for representing and relighting 3D scenes illuminated by a point light using a set of relightable 3D Gaussian points.	Decoupling illumination in 3D scenes is crucial for applications like novel view synthesis and relighting, especially under challenging One Light At a Time (OLAT) settings.	The method decomposes the scene into ambient, diffuse, and specular components using the Blinn-Phong model and employs a bilevel optimization-based meta-learning framework to learn light-independent geometric information.	Phong-GID demonstrates superior performance in novel view synthesis and relighting compared to existing 3D Gaussian Splatting-based methods on both synthetic and real-world OLAT datasets. The proposed meta-learning framework effectively learns uniform Gaussian geometries that generalize across diverse viewpoints and light positions. Ablation studies confirm the effectiveness of the decomposed rendering pipeline, geometry optimization via meta-learning, and introduced geometry and color priors.	The model's robustness in handling extreme lighting conditions or highly complex scenes requires further investigation. Future work will focus on extending the model to handle more challenging lighting scenarios and complex scene geometries.	3d relighting, 3d gaussian splatting, novel view synthesis, meta-learning, blinn-phong model
2405.20750 Report	Diffusion Models Are Innate One-Step Generators	Bowen Zheng, Tianming Yang	Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs' layers are activated differently at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs. This breakthrough not only enhances the understanding of efficient image generation models but also offers a scalable framework for advancing the state of the art in various applications.	This paper introduces GDD, a novel distributional distillation method for training one-step image generators from pre-trained diffusion models, using only a distributional loss (GAN loss) without instance-level supervision.	Diffusion models (DMs) excel in image generation but suffer from high computational cost due to multi-step sampling. Existing distillation methods are either computationally expensive or yield suboptimal performance.	The authors first analyze limitations of instance-based distillation methods, attributing it to different local minima between teacher and student models. They then propose GDD, which uses solely a GAN loss for training a one-step generator, initialized from a pre-trained DM, against real data.	GDD surpasses state-of-the-art (SOTA) results on CIFAR-10, AFHQv2 64x64, FFHQ 64x64, and ImageNet 64x64 with fewer training images. Analysis reveals differential activation of DM layers across time steps, suggesting innate one-step generation capability. GDD-I, a variant freezing most convolutional layers during distillation, further improves performance, supporting the innate capability hypothesis.	Experiments are mainly conducted on low-resolution datasets, and performance on high-resolution datasets needs further investigation. While the study shows differential layer activation, the specific roles of these layers in multi-step vs. one-step generation remain to be explored.	diffusion models, image generation, model distillation, generative adversarial networks (gans), one-step generation
2405.20721 Report	ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model	Yufei Wang, Zhihao Li, Lanqing Guo, Wenhan Yang, Alex C. Kot, Bihan Wen	Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i.e., coding all the neural Gaussians at the same time, with little design for their interactions and spatial dependence. Inspired by the effectiveness of the context model in image compression, we propose the first autoregressive model at the anchor level for 3DGS compression in this work. We divide anchors into different levels and the anchors that are not coded yet can be predicted based on the already coded ones in all the coarser levels, leading to more accurate modeling and higher coding efficiency. To further improve the efficiency of entropy coding, e.g., to code the coarsest level with no already coded anchors, we propose to introduce a low-dimensional quantized feature as the hyperprior for each anchor, which can be effectively compressed. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS, while achieving comparable or even higher rendering quality.	This paper proposes ContextGS, a novel autoregressive model for compressing 3D Gaussian Splatting (3DGS) representations by leveraging spatial dependencies among anchor points.	3DGS enables fast, high-fidelity novel view synthesis but suffers from large storage requirements, necessitating efficient compression techniques.	ContextGS divides anchors into hierarchical levels, using decoded anchors from coarser levels to predict the distribution of anchors at finer levels. Additionally, it employs a quantized hyperprior feature as an additional prior for each anchor to enhance entropy coding efficiency.	Achieves an average compression ratio of 15x compared to Scaffold-GS and 100x compared to standard 3DGS. Maintains comparable or even higher rendering quality compared to previous methods. Demonstrates the effectiveness of anchor-level context modeling and hyperprior features in reducing spatial redundancy.	Entropy coding process introduces additional computational costs during training and decompression. Further exploration of anchor position compression is needed for optimal performance.	3d gaussian splatting, 3dgs compression, context modeling, autoregressive models, novel view synthesis
2405.20674 Report	4Diffusion: Multi-view Video Diffusion Model for 4D Generation	Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao	Current 4D generation methods have achieved noteworthy efficacy with the aid of advanced diffusion generative models. However, these methods lack multi-view spatial-temporal modeling and encounter challenges in integrating diverse prior knowledge from multiple diffusion models, resulting in inconsistent temporal appearance and flickers. In this paper, we propose a novel 4D generation pipeline, namely 4Diffusion aimed at generating spatial-temporally consistent 4D content from a monocular video. We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. After training on a curated dataset, our diffusion model acquires reasonable temporal consistency and inherently preserves the generalizability and spatial consistency of the 3D-aware diffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling loss, which is based on our multi-view video diffusion model, to optimize 4D representation parameterized by dynamic NeRF. This aims to eliminate discrepancies arising from multiple diffusion models, allowing for generating spatial-temporally consistent 4D content. Moreover, we devise an anchor loss to enhance the appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance compared to previous methods.	This paper proposes 4Diffusion, a novel pipeline for generating 4D content from monocular videos, featuring a unified diffusion model called 4DM for multi-view spatial-temporal consistency.	Generating high-quality 4D content with spatial-temporal consistency is challenging due to the limitations of integrating knowledge from multiple diffusion models in previous approaches, leading to artifacts like inconsistent appearance and flickers.	The authors design 4DM by incorporating a learnable motion module into a frozen 3D-aware diffusion model. They then leverage 4DM to optimize dynamic NeRF using a 4D-aware SDS loss and an anchor loss for enhanced appearance details.	4Diffusion generates 4D content with superior spatial-temporal consistency and motion coherence compared to baseline methods. The proposed 4DM effectively captures multi-view spatial-temporal correlations even when trained on a small, curated dataset. Quantitative evaluations using CLIP-I, CLIP-C, FVD, and LPIPS demonstrate the superiority of 4Diffusion over existing techniques.	The quality of the multi-view video diffusion model is limited by the base model's capability and the scale of the high-quality training data. The reliance on volumetric rendering in the 4D generation pipeline leads to slow training speeds, demanding exploration of faster 3D and GS techniques.	4d content generation, diffusion models, multi-view video generation, dynamic nerf, spatial-temporal consistency
2405.20669 Report	Fourier123: One Image to High-Quality 3D Object Generation with Hybrid Fourier Score Distillation	Shuzhou Yang, Yu Wang, Haijie Li, Jiarui Meng, Xiandong Meng, Jian Zhang	Single image-to-3D generation is pivotal for crafting controllable 3D assets. Given its underconstrained nature, we leverage geometric priors from a 3D novel view generation diffusion model and appearance priors from a 2D image generation method to guide the optimization process. We note that a disparity exists between the training datasets of 2D and 3D diffusion models, leading to their outputs showing marked differences in appearance. Specifically, 2D models tend to deliver more detailed visuals, whereas 3D models produce consistent yet over-smooth results across different views. Hence, we optimize a set of 3D Gaussians using 3D priors in spatial domain to ensure geometric consistency, while exploiting 2D priors in the frequency domain through Fourier transform for higher visual quality. This 2D-3D hybrid Fourier Score Distillation objective function (dubbed hy-FSD), can be integrated into existing 3D generation methods, yielding significant performance improvements. With this technique, we further develop an image-to-3D generation pipeline to create high-quality 3D objects within one minute, named Fourier123. Extensive experiments demonstrate that Fourier123 excels in efficient generation with rapid convergence speed and visual-friendly generation results.	This paper proposes Fourier123, an efficient image-to-3D generation pipeline that leverages both spatial and frequency domain information to generate high-quality 3D objects within one minute.	Single image-to-3D generation is crucial for creating controllable 3D assets, but existing methods struggle to balance efficiency and visual quality.	The paper introduces hybrid Fourier Score Distillation (hy-FSD) which uses a 3D diffusion model for geometric consistency in the spatial domain and a 2D diffusion model for high-quality appearance in the frequency domain. Fourier123 initializes with a large 3D reconstruction model and then optimizes using hy-FSD.	hy-FSD significantly improves the performance of existing optimization-based 3D generation methods. Fourier123 generates high-quality 3D objects with reliable structures and elegant appearances. Fourier123 achieves a good balance between generation quality and speed, producing results within one minute on a single NVIDIA 4090 GPU.	The method may occasionally encounter generation failures due to the inherent randomness of the task. Future work could focus on improving the robustness and generalization ability of the method.	image-to-3d generation, 3d gaussian splatting, score distillation sampling, diffusion models, frequency domain analysis
2405.20510 Report	Physically Compatible 3D Object Modeling from a Single Image	Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Elaine Owens, Chuang Gan, Joshua B. Tenenbaum, Kaiming He, Wojciech Matusik	We present a computational framework that transforms single images into 3D physical objects. The visual geometry of a physical object in an image is determined by three orthogonal attributes: mechanical properties, external forces, and rest-shape geometry. Existing single-view 3D reconstruction methods often overlook this underlying composition, presuming rigidity or neglecting external forces. Consequently, the reconstructed objects fail to withstand real-world physical forces, resulting in instability or undesirable deformation -- diverging from their intended designs as depicted in the image. Our optimization framework addresses this by embedding physical compatibility into the reconstruction process. We explicitly decompose the three physical attributes and link them through static equilibrium, which serves as a hard constraint, ensuring that the optimized physical shapes exhibit desired physical behaviors. Evaluations on a dataset collected from Objaverse demonstrate that our framework consistently enhances the physical realism of 3D models over existing methods. The utility of our framework extends to practical applications in dynamic simulations and 3D printing, where adherence to physical compatibility is paramount.	This paper proposes a computational framework that reconstructs physically plausible 3D objects from single images by incorporating physical compatibility constraints.	Existing single-view 3D reconstruction methods often neglect physical principles, leading to objects that exhibit instability or unrealistic deformation under real-world forces. This limits their practical utility in applications like simulation and 3D printing.	The framework explicitly decomposes the object's geometry into mechanical properties, external forces, and rest-shape geometry, linked through static equilibrium constraints. It then optimizes the rest-shape geometry using implicit differentiation to ensure the object aligns with the input image while adhering to physical laws.	The method improves the physical compatibility of 3D models generated by various single-view reconstruction techniques. Objects generated using this framework exhibit enhanced stability, reduced stress, and greater fidelity to the input image under simulated gravity. The framework enables the generation of objects with diverse physical behaviors from the same image by varying material properties.	The framework currently relies on predefined material properties and external forces, limiting its automation. Future work includes exploring differentiable mesh conversion for seamless integration with pre-trained reconstruction models and extending the approach to dynamic object reconstruction from videos.	3d reconstruction, physical simulation, static equilibrium, implicit differentiation, fabrication-aware design
2405.20343 Report	Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image	Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, Kaisheng Ma	In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.	Unique3D is a novel image-to-3D framework that efficiently generates high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability.	Previous methods suffer from long optimization times, inconsistencies, and limitations in generated resolution, hindering their ability to produce intricate textures and complex geometries. Unique3D aims to address these challenges and achieve high fidelity, consistency, and efficiency in single-image 3D generation.	Unique3D uses a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images and normal maps. It then employs a multi-level upscale process to improve resolution and introduces ISOMER, an instant and consistent mesh reconstruction algorithm that integrates color and geometric priors into the final mesh.	Unique3D significantly outperforms existing image-to-3D baselines in terms of geometric and textural details, as demonstrated through extensive experiments. The method achieves high resolution and intricate details in both geometry and material, surpassing previous approaches. Unique3D generates high-fidelity, diverse, and multi-view consistent meshes from single-view wild images within 30 seconds.	The multi-view prediction model may produce less satisfactory predictions for skewed or non-perspective input images. The geometric coloring algorithm currently does not support texture maps. Future work aims to enhance the robustness of the multi-view prediction model by training on a more extensive and diverse dataset and incorporate texture map support in the coloring algorithm.	image-to-3d, 3d mesh generation, diffusion models, mesh reconstruction, isomer
2405.20340 Report	MotionLLM: Understanding Human Behaviors from Human Motions and Videos	Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang	This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.	Introduced MotionLLM, a unified framework to understand human behaviors from both video and motion data, bridging the gap between these modalities and language.	Existing LLM methods for human behavior understanding focus on either video or motion, failing to leverage the complementary advantages of both. Joint modeling is crucial for capturing nuanced body dynamics and semantics.	MotionLLM employs a two-stage training strategy: 1) Modality translation to project motion and video data into linguistic space using trainable translators. 2) Motion-video unified instruction tuning to fine-tune both translators and the LLM using a new dataset, MoVid, containing paired video-motion-text data.	MotionLLM significantly outperforms previous methods in both motion and video understanding benchmarks. Ablation studies show that integrating motion data improves video understanding, and vice versa, demonstrating the effectiveness of joint modeling. MotionLLM exhibits strong spatial-temporal comprehension and reasoning abilities for human behaviors, paving the way for applications like fitness coaching.	The video encoder's limited capacity restricts the amount of video information processed. Future work could explore higher-capacity video encoders and investigate potential negative impacts of LLM advancements.	human behavior understanding, large language models, multi-modality learning, video understanding, motion analysis
2405.20339 Report	Visual Perception by Large Language Model's Weights	Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun	Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.	This paper proposes VLoRA, a novel parameter space alignment paradigm for Multimodal Large Language Models (MLLMs) that enhances efficiency by representing visual information as model weights instead of using visual tokens.	Existing MLLMs, based on input space alignment with visual tokens, suffer from high computational costs due to increased input sequence length, especially for high-resolution images. VLoRA addresses this inefficiency by eliminating the need for visual tokens in LLM input.	VLoRA uses a vision encoder to extract visual features from an image and then converts these features into perceptual weights using a perceptual weights generator. These weights, designed with a low-rank property similar to LoRA, are directly merged with the LLM's weights, enabling visual perception without extra input tokens.	VLoRA achieves comparable performance to state-of-the-art MLLMs on benchmarks like MMBench, ScienceQA, HallusionBench, and MMMU. It significantly reduces computational overhead, requiring only 8% of the FLOPs of LLaVA-v1.5 for inference. Ablation studies demonstrate the impact of different components, such as the type of LLM weights integrated and the rank of perceptual weights.	The current vision encoder, CLIP, might not be optimal for converting features to model weights, demanding exploration of more suitable encoders. Using separate perceptual weights generators for each weight type may limit inter-weight correlation, suggesting a potential improvement by generating all weight types from a single generator.	multimodal large language models, parameter space alignment, perceptual weights, low-rank adaptation, computational efficiency
2405.20337 Report	OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving	Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu	Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.	This paper proposes OccSora, a diffusion-based 4D occupancy generation model that simulates the development of 3D worlds for autonomous driving, conditioned on a trajectory prompt.	Understanding the evolution of 3D scenes is crucial for effective autonomous driving. Existing methods struggle to efficiently model long-term temporal evolutions.	The approach utilizes a 4D scene tokenizer to compress 4D occupancy data into compact representations. Then, a diffusion transformer learns from these representations and generates 4D occupancy conditioned on trajectory information.	OccSora achieves high-quality reconstruction for long-sequence occupancy videos. The model generates realistic 16s-long videos with authentic 3D layout and temporal consistency. OccSora exhibits the ability to generate diverse scenes conditioned on different input trajectories.	The granularity of voxel data limits the level of detail in the generated scenes. Inconsistent details for moving objects suggest a need for larger and more diverse training data.	autonomous driving, world models, 4d occupancy, diffusion models, trajectory generation
2405.20336 Report	RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text	Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan	In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.	This paper introduces a novel framework for the simultaneous generation of 3D whole-body motions and singing vocals directly from textual lyrics, aiming to create more immersive and realistic digital interactions.	This endeavor is crucial for enhancing virtual performances, interactive gaming, and virtual avatar realism by creating a more expressive and nuanced communication of emotions, intentions, and context in digital content.	The authors introduce 'RapVerse,' a large-scale dataset with lyrics, vocals, and 3D motions. They employ VQVAEs to represent motion as discrete tokens and a Vocal2unit model for quantized audio tokens. A transformer-based architecture then jointly models these modalities for unified generation.	The proposed framework generates realistic singing vocals and human motions directly from text, achieving temporal alignment between the two modalities. The model rivals the performance of specialized single-modality generation systems, demonstrating its effectiveness in joint generation. Using compositional VQVAEs for motion encoding, particularly separate ones for face, body, and hand, is crucial for capturing detailed facial expressions, leading to more realistic motion synthesis.	The current dataset, 'RapVerse,' is limited to rap music, and expanding to other music genres is left for future work. Future work could explore multi-performer audio and motion generation, such as virtual live bands, for broader applications.	text-to-speech, text-to-motion, multimodal generation, deep learning, computer vision
2405.20334 Report	VividDream: Generating 3D Scene with Ambient Dynamics	Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y. Feng, Jia-Bin Huang	We introduce VividDream, a method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. VividDream first expands an input image into a static 3D point cloud through iterative inpainting and geometry merging. An ensemble of animated videos is then generated using video diffusion models with quality refinement techniques and conditioned on renderings of the static 3D scene from the sampled camera trajectories. We then optimize a canonical 4D scene representation using an animated video ensemble, with per-video motion embeddings and visibility masks to mitigate inconsistencies. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics. Experiments demonstrate that VividDream can provide human viewers with compelling 4D experiences generated based on diverse real images and text prompts.	VividDream: a novel method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt.	Current research on 4D generation primarily focuses on individual objects, lacking the ability to create comprehensive and immersive 4D scenes with ambient motion.	The method consists of three stages: 1) Expanding an initial 3D point cloud via iterative inpainting and geometry merging, 2) Generating an ensemble of animated videos using diffusion models conditioned on renderings of the static scene, 3) Optimizing a 4D scene representation using the animated videos, addressing inconsistencies with visibility masking and per-video motion embeddings.	Generates compelling 4D scene experiences with plausible ambient dynamics from real images and text prompts. Overcomes the limitations of single-video reconstruction by utilizing multi-view animation and mitigating inconsistencies. Enables free-view exploration of the generated 4D scenes, offering a more immersive experience compared to static 3D.	Reliance on a series of successful processes in 3D scene generation and video generation can lead to quality degradation if any stage fails (e.g., inaccurate depth estimation). Limited control over scene motion generation, particularly for non-realistic images, highlighting the need for more advanced video generation models.	4d scene generation, ambient dynamics, text-to-3d, video diffusion models, multi-view animation
2405.20330 Report	4DHands: Reconstructing Interactive Hands in 4D with Transformers	Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Yebin Liu, Wei Jing, Qi Yan, Qianying Wang, Hongwen Zhang	In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transformer-based architecture with novel tokenization and feature fusion strategies. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://4dhands.github.io.	4DHands, a robust method for reconstructing interactive hand meshes and their relative motion from monocular images, addressing limitations of previous approaches in handling diverse hand inputs and capturing inter-hand relationships.	Accurate and stable 4D hand mesh recovery is crucial for applications like VR/AR, HCI, robotics, and embodied AI, particularly in real-world scenarios with complex hand interactions.	Transformer-based architecture featuring (1) Relation-aware Two-Hand Tokenization (RAT) to embed positional information into hand tokens, enabling unified handling of single/two-hand inputs and capturing relative hand positions; (2) Spatio-temporal Interaction Reasoning (SIR) module for fusing 4D hand features and decoding them into 3D meshes and temporal movements.	Outperforms state-of-the-art methods on InterHand2.6M and DexYCB datasets for both single and two-hand mesh reconstruction. Shows superior stability and accuracy on in-the-wild datasets (HIC, ARCTIC, RenderIH) compared to previous methods. Achieves robust 4D hand recovery even with occlusions and motion blur by effectively fusing temporal information.	Performance slightly degrades on in-the-wild datasets compared to InterHand2.6M due to the domain gap. Future work includes exploring hand-object interactions and incorporating hand gestures for more comprehensive understanding.	4d hand mesh recovery, monocular reconstruction, transformer, hand interaction, spatio-temporal reasoning
2405.20327 Report	GECO: Generative Image-to-3D within a SECOnd	Chen Wang, Jiatao Gu, Xiaoxiao Long, Yuan Liu, Lingjie Liu	3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in current methods through a two-stage approach. In the initial stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency from the multi-view prediction. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D generation with an unprecedented level of efficiency.	This paper introduces GECO, a novel method for high-quality 3D generative modeling that operates within a second, addressing uncertainty and inefficiency issues in existing methods.	Generating 3D assets is crucial for various applications, but existing methods are either time-consuming (score distillation) or compromise quality (reconstruction-based). GECO aims to bridge this gap, enabling fast and high-quality 3D generation.	GECO utilizes a two-stage distillation approach: 1) Training a single-step multi-view generative model with score distillation from a pre-trained multi-view diffusion model. 2) Addressing view inconsistency through a second-stage distillation, jointly finetuning the multi-view generator and a pretrained 3D reconstruction model.	GECO achieves high-quality 3D generation within a second, surpassing previous feed-forward baselines in visual quality, particularly in unseen views. Quantitative comparisons on the GSO dataset demonstrate GECO's superior performance in PSNR, SSIM, and LPIPS compared to existing methods, including those relying on multi-step diffusion sampling. GECO exhibits diversity, generating varied 3D models from different random seeds for the same input image, highlighting its generative capabilities.	The training process involves two stages, which could be simplified in future work. The quality of generated 3D models is limited by the consistency of multi-step sampling results from multi-view diffusion models.	3d generation, score distillation, gaussian splatting, image-to-3d, generative modeling
2405.20325 Report	MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion	Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang	Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.	MotionFollower, a lightweight score-guided diffusion model for video motion editing that transfers motion from a target video to a source video while preserving the source's background, protagonist's appearance, and camera movement.	Existing video editing models mainly focus on attribute-level editing and struggle to modify motion information while preserving other video details. MotionFollower addresses this gap by enabling motion editing while maintaining fidelity to the source video.	MotionFollower employs two lightweight signal controllers (Pose Controller and Reference Controller) for efficient pose and appearance control. It also introduces a novel score guidance principle with a two-branch architecture (reconstruction and editing) to enforce consistency and preserve background and foreground details.	MotionFollower achieves accurate motion editing and appearance preservation, outperforming competitors in qualitative comparisons. Quantitative results demonstrate superior single-frame quality and video fidelity compared to state-of-the-art methods, with an 80% reduction in GPU memory compared to MotionEditor. The model effectively handles large camera movements and complex backgrounds, demonstrating robustness and versatility in motion editing.	MotionFollower struggles with background inpainting when the source video contains occlusions of small, distinct objects. Future work includes exploring explicit inpainting adaptors to address background recovery in challenging scenarios.	video motion editing, diffusion models, score guidance, appearance preservation, camera movement
2405.20324 Report	Don't drop your samples! Coherence-aware training benefits Conditional diffusion	Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, David Picard	Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.	The paper introduces Coherence-Aware Diffusion (CAD), a novel method for training conditional diffusion models that incorporates a coherence score to address the issue of noisy or unreliable conditional information, leading to improved generation quality and adherence to conditions.	Training conditional diffusion models often relies on large, noisy datasets with misaligned image-condition pairs. Existing filtering methods discard valuable data, hindering performance. This paper introduces a method to leverage this discarded data to improve generation quality.	The proposed CAD method estimates a coherence score, reflecting the alignment between an image and its condition. This score is then used to condition the diffusion model alongside the original condition, enabling it to learn from both well-aligned and misaligned pairs. The authors also propose Coherence-Aware Classifier-Free Guidance (CA-CFG), refining CFG using coherence scores for enhanced image quality.	CAD significantly outperforms baselines in terms of FID, achieving a 15-point improvement in text-to-image generation, while maintaining comparable CLIP scores. User studies overwhelmingly favor CAD-generated images, indicating superior quality and prompt adherence. Incorporating coherence scores improves semantic segmentation, enabling better object shape reconstruction and scene understanding, even with noisy or incomplete segmentation maps.	The success of CAD depends heavily on the quality of coherence score estimation, which needs further investigation. Future work includes exploring more robust and reliable methods for obtaining coherence scores to enhance CAD's effectiveness and generalizability.	conditional image generation, diffusion models, coherence score, text-to-image synthesis, semantic segmentation
2405.20323 Report	$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving	Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang	Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: https://github.com/nnanhuang/S3Gaussian/.	This paper proposes $S^3$Gaussian, the first self-supervised method to decompose dynamic and static 3D Gaussians in street scenes without manual annotations, for efficient 3D scene reconstruction.	Photorealistic 3D reconstruction of street scenes is crucial for autonomous driving simulators, and while 3D Gaussian Splatting (3DGS) is promising for its speed and explicitness, existing methods often require costly 3D bounding box annotations.	The method uses 3D Gaussians and a novel spatial-temporal field network. This network, with a multi-resolution Hexplane encoder and a multi-head Gaussian decoder, captures 4D dynamics and deforms the Gaussians, enabling self-supervised scene decomposition.	$S^3$Gaussian achieves state-of-the-art rendering quality in scene reconstruction and novel view synthesis on Waymo-Open dataset. It effectively decomposes static and dynamic scenes without 3D annotations. The method surpasses previous approaches in reconstructing distant dynamic objects and capturing scene details.	Modeling objects at high speeds is challenging due to the high variance in deformation fields and sparse views. Future work includes addressing the limitations in reconstructing high-speed dynamic scenes.	3d scene reconstruction, autonomous driving, gaussian splatting, self-supervised learning, dynamic scenes
2405.20320 Report	Improving the Training of Rectified Flows	Sangyun Lee, Zinan Lin, Giulia Fanti	Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 72% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\times$64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.	This paper introduces improved training techniques for rectified flows, enabling them to achieve competitive performance with knowledge distillation methods in the low function evaluation (NFE) regime, particularly for one- and two-step generation.	Rectified flows offer advantages over knowledge distillation methods, such as generalizability to arbitrary distributions, support for inversion, likelihood evaluation, and flexible sample quality control, making them a promising alternative.	The authors observe that the optimal 2-rectified flow generally exhibits near-zero trajectory curvature. Building upon this, they propose improved training techniques including a U-shaped timestep distribution to focus on challenging timesteps and a LPIPS-Huber premetric to enhance perceptual similarity.	The improved 2-rectified flow++ outperforms state-of-the-art distillation methods in the 1-2 NFE regime on CIFAR-10 and ImageNet 64x64. 2-rectified flow++ achieves substantial FID reductions of up to 72% compared to vanilla 2-rectified flows. The study demonstrates the potential computational efficiency of Reflow compared to other distillation methods.	While showing promise, 2-rectified flow++ doesn’t yet outperform the best consistency models like iCT. The training process for 2-rectified flow++ is slower than previous rectified flows due to the computational overhead from the LPIPS loss.	rectified flows, diffusion models, generative modeling, knowledge distillation, low function evaluation
2405.20310 Report	A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction	Jianghao Shen, Nan Xue, Tianfu Wu	Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view.	This paper introduces Hierarchical Splatter Image, a novel method for single-view 3D reconstruction that enhances the existing Splatter Image method by employing a hierarchy of parent-child 3D Gaussians to represent each pixel.	The importance stems from addressing the limitations of conventional single-view 3D reconstruction techniques, particularly in representing occluded structures not visible in the input view. This hierarchical representation aims to improve the accuracy and reliability of 3D reconstruction from a single image.	The methodology involves a two-stage learning process. Initially, parent 3D Gaussians are learned similarly to the vanilla Splatter Image. Subsequently, child 3D Gaussians are learned using lightweight MLPs, taking inputs from the parent Gaussian features and target camera view embeddings to recover occluded details.	The proposed method achieves state-of-the-art performance on four single-image 3D reconstruction benchmarks (ShapeNet-SRN Chairs & Cars, CO3D Hydrants & Teddybears). It demonstrates superior reconstruction of occluded content compared to the baseline Splatter Image method. The approach maintains comparable model complexity to Splatter Image with a negligible increase in computational overhead.	The performance slightly degrades when using relative camera positions instead of world coordinates. Future work may explore incorporating richer input view information within the parent Gaussian features to improve relative camera pose handling.	3d reconstruction, single-view reconstruction, 3d gaussian splatting, novel view synthesis, hierarchical representation
2405.20305 Report	Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models	Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee	We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.	Introduces PlausiVL, a Video-Language Model (VLM) for anticipating plausible future action sequences in videos by incorporating temporal logic and reducing action repetition.	Action anticipation is crucial for AI agents to understand and react to their environment, but current methods struggle to generate plausible and diverse sequences of actions.	PlausiVL uses a Q-former to embed videos and align them with text embeddings in a large language model. It is trained with two novel losses: (1) Plausible Action Sequence Learning Loss, which uses counterfactuals based on temporal logic and verb-noun constraints to distinguish plausible sequences, and (2) Long-Horizon Action Repetition Loss, which penalizes repeated actions over longer timespans.	PlausiVL outperforms existing VLM and other action anticipation methods on Ego4D and EPIC-Kitchens datasets. Ablation studies confirm that both novel losses contribute to the model's improved performance in generating plausible and diverse action sequences. PlausiVL demonstrates robustness to long-tail distributions and generalizability to unseen data.	The model may still hallucinate implausible sequences, which warrants further investigation. Future work could explore incorporating additional modalities, such as audio, to enhance the model's understanding of the scene.	action anticipation, video-language models, temporal logic, plausibility, action repetition
2405.20283 Report	TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes	Minghao Guo, Bohan Wang, Kaiming He, Wojciech Matusik	We present TetSphere splatting, an explicit, Lagrangian representation for reconstructing 3D shapes with high-quality geometry. In contrast to conventional object reconstruction methods which predominantly use Eulerian representations, including both neural implicit (e.g., NeRF, NeuS) and explicit representations (e.g., DMTet), and often struggle with high computational demands and suboptimal mesh quality, TetSphere splatting utilizes an underused but highly effective geometric primitive -- tetrahedral meshes. This approach directly yields superior mesh quality without relying on neural networks or post-processing. It deforms multiple initial tetrahedral spheres to accurately reconstruct the 3D shape through a combination of differentiable rendering and geometric energy optimization, resulting in significant computational efficiency. Serving as a robust and versatile geometry representation, Tet-Sphere splatting seamlessly integrates into diverse applications, including single-view 3D reconstruction, image-/text-to-3D content generation. Experimental results demonstrate that TetSphere splatting outperforms existing representations, delivering faster optimization speed, enhanced mesh quality, and reliable preservation of thin structures.	This paper introduces TetSphere Splatting (Tet-Splatting), a novel geometry representation for reconstructing 3D shapes using an explicit, Lagrangian approach based on deforming tetrahedral meshes.	Existing methods for 3D shape reconstruction, including neural implicit representations and Eulerian approaches, often suffer from high computational demands and suboptimal mesh quality. Tet-Splatting aims to address these limitations by providing fast optimization, enhanced mesh quality, and robust handling of thin structures.	Tet-Splatting represents 3D shapes using a collection of deformed tetrahedral spheres. It reconstructs the target shape by optimizing the positions of the tetrahedra vertices through differentiable rendering and geometric energy minimization, including bi-harmonic energy for smoothness and local injectivity for element orientation.	Tet-Splatting achieves superior mesh quality compared to state-of-the-art methods on the Google Scanned Objects dataset. It demonstrates faster optimization speed and reduced memory usage, particularly beneficial for image-to-3D and text-to-3D generation. Tet-Splatting effectively handles shapes with complex topologies and thin structures.	The current implementation does not guarantee topology preservation during the union of tetrahedral spheres. Future work could explore incorporating direct 3D supervision with volumetric data.	3d reconstruction, lagrangian representation, tetrahedral mesh, mesh quality, differentiable rendering
2405.20282 Report	SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow	Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, Ming-Hsuan Yang	Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified diffusion-based framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision. Project page: https://github.com/wang-chaoyang/SemFlow.	This paper proposes SemFlow, a unified diffusion-based framework for semantic segmentation and semantic image synthesis, modeling them as a pair of reverse problems using rectified flow.	This work bridges the gap between traditionally distinct methodologies for semantic segmentation (discriminative models) and semantic image synthesis (generative models).	SemFlow leverages rectified flow, an ODE framework, to learn the bi-directional mapping between image and semantic mask distributions. It introduces pseudo masks, bi-directional training, and a finite perturbation strategy to enhance synthesis diversity.	SemFlow achieves competitive semantic segmentation results compared to discriminative models while using fewer inference steps. It demonstrates promising performance on semantic image synthesis, outperforming some specialist models in FID and LPIPS. The finite perturbation method enables multi-modal image generation from a single semantic layout.	There is still a performance gap in semantic segmentation accuracy compared to state-of-the-art discriminative models. Future work could explore incorporating stronger priors or guidance mechanisms within the unified framework.	semantic segmentation, semantic image synthesis, diffusion models, rectified flow, deep learning
2405.20279 Report	CV-VAE: A Compatible Video VAE for Latent Generative Video Models	Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan	Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.	This paper introduces CV-VAE, a novel video Variational Autoencoder (VAE) designed to be compatible with pre-trained image and video models like Stable Diffusion, addressing the lack of a commonly used continuous 3D VAE for latent diffusion-based video models.	Current video generation models often rely on uniform frame sampling for temporal compression, leading to unsmooth motion. This work aims to enable the generation of smoother, higher-FPS videos by providing a truly spatio-temporally compressed continuous latent space.	The authors propose a novel latent space regularization method to ensure compatibility between the video VAE and pre-trained models, minimizing distribution shifts. They also introduce an efficient 2D+3D architecture for the video VAE, leveraging pre-trained weights and incorporating 3D convolutions for temporal modeling.	CV-VAE achieves state-of-the-art image and video reconstruction quality while maintaining compatibility with existing diffusion models. Integrating CV-VAE into pre-trained video models like SVD significantly improves video generation quality, producing smoother motion and higher FPS with minimal finetuning. Ablation studies validate the effectiveness of the proposed latent space regularization and mapping functions in improving video reconstruction and generation.	The performance of CV-VAE is limited by the channel dimension of the latent space, which is constrained by the compatibility requirement with existing models. Future work could explore higher-dimensional latent spaces and investigate the impact on reconstruction and generation quality.	video generation, variational autoencoder (vae), latent space, stable diffusion, temporal compression
2405.20224 Report	EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images	Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, Yonghong Tian	3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D scene reconstruction and novel view synthesis. However, its training heavily depends on high-quality, sharp images and accurate camera poses. Fulfilling these requirements can be challenging in non-ideal real-world scenarios, where motion-blurred images are commonly encountered in high-speed moving cameras or low-light environments that require long exposure times. To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that integrates event streams captured by an event camera to assist in reconstructing high-quality 3D-GS from blurry images. Capitalizing on the high temporal resolution and dynamic range offered by the event camera, we leverage the event streams to explicitly model the formation process of motion-blurred images and guide the deblurring reconstruction of 3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion trajectories during the exposure time, our method can robustly facilitate the acquisition of high-fidelity novel views with intricate texture details. We comprehensively evaluated our method and compared it with previous state-of-the-art deblurring rendering methods. Both qualitative and quantitative comparisons demonstrate that our method surpasses existing techniques in restoring fine details from blurry images and producing high-fidelity novel views.	EvaGaussians is introduced, a novel framework that integrates event streams from an event camera to reconstruct high-quality 3D Gaussian Splats from motion-blurred images, enabling real-time, high-fidelity novel view synthesis.	3D Gaussian Splatting (3D-GS), while efficient in 3D scene reconstruction and novel view synthesis, heavily relies on sharp images and accurate camera poses, which are often absent in real-world scenarios with motion blur.	The method leverages event streams to model motion blur and guide deblurring reconstruction. It uses the EDI model for initial camera trajectory and point cloud estimation. Then, it jointly optimizes 3D-GS parameters and camera trajectories during exposure time, guided by blur and event reconstruction losses.	Outperforms state-of-the-art deblurring rendering methods on synthetic and real-world datasets. Demonstrates superior performance in recovering intricate details and color accuracy from motion-blurred images. Enables high-fidelity real-time novel view synthesis.	May face challenges with extremely intricate textures and severe blur. Potential for misuse in surveillance applications, raising privacy concerns.	3d gaussian splatting, event cameras, motion deblurring, novel view synthesis, 3d reconstruction
2405.20222 Report	MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model	Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng	We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/	Presents MOFA-Video, a controllable image animation method that generates videos from images using various controllable signals (e.g., landmarks, trajectories) or their combinations.	Overcomes limitations of previous methods that either focus on specific object categories or exhibit weak control abilities with diffusion priors, enabling controllable animation of in-the-wild images.	Designs domain-aware Motion Field Adapters (MOFA-Adapters) for different motion domains, which generate dense motion fields from sparse control signals and warp image features to guide video diffusion generation.	Achieves fine-grained control over object and camera motion with handcrafted trajectories, outperforming DragNUWA in controllability and visual quality. Enables portrait animation from audio using facial landmarks, surpassing StyleHEAT and SadTalker in identity preservation, artifact reduction, and motion naturalness. Allows combining multiple MOFA-Adapters for complex animations, such as controlling facial expressions and background motion simultaneously.	Limited ability to generate content significantly different from the input image due to the video diffusion model's training data. May produce visual artifacts like blurriness or structure loss under large motion guidance.	image animation, video generation, controllable generation, motion field adaptation, video diffusion models
2405.20216 Report	Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback	Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee	The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.	Presents HG-DPO, a novel method to enhance human image generation in text-to-image models by leveraging Direct Preference Optimization (DPO)	Addresses the limitations of existing T2I models in generating high-quality human images that meet complex human preferences regarding anatomy, pose, and alignment with text prompts	Proposes a two-pronged approach: 1. Constructs a specialized DPO dataset using AI feedback (PickScore metric) to efficiently generate preferred and non-preferred image pairs. 2. Introduces a modified loss function (statistic matching loss) during DPO training to minimize artifacts and improve image fidelity.	Generates human images with more natural anatomies and poses compared to baselines. Demonstrates superior alignment with text prompts, effectively capturing user intent. Adaptable to other human-centric applications, such as personalized text-to-image generation (e.g., improving InstantBooth model).	Acknowledges a trade-off between increased image quality and potential decrease in diversity. Limited impact on enhancing fine anatomical details (e.g., fingers).	text-to-image generation, human image synthesis, direct preference optimization, diffusion models, ai feedback
2405.20204 Report	Jina CLIP: Your CLIP Model Is Also Your Text Retriever	Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao	Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.	The paper proposes Jina CLIP, a novel multi-task contrastive training method and model that achieves state-of-the-art performance on both text-image and text-text retrieval tasks.	CLIP models usually underperform in text-only tasks compared to specialized text models, creating inefficiencies for information retrieval systems. Jina CLIP addresses this by enabling a single model to perform well in both modalities.	The methodology involves a three-stage training process: (1) aligning image and short text representations with text pair training, (2) introducing longer, synthetic image captions, and (3) fine-tuning with hard negatives for improved text encoding.	Jina CLIP achieves comparable performance to EVA-CLIP on the cross-modal CLIP Benchmark. The model's text encoder performs on par with specialized text models on MTEB Benchmark tasks. It significantly outperforms other CLIP models in text-only tasks, demonstrating the effectiveness of the multi-task training.	The model is currently limited to English-language texts due to resource constraints. Future work will focus on extending the model to multilingual contexts.	clip, embeddings, multimodal, retrieval, contrastive learning
2405.20155 Report	MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models	Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer	Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of arbitrary 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at https://lukas.uzolas.com/MotionDreamer.	This paper introduces a novel zero-shot method for animating arbitrary 3D meshes using pre-trained video diffusion models (VDMs), leveraging semantic features extracted from the VDMs for accurate motion fitting.	Manual animation is time-consuming and existing automated methods are limited to specific shapes. This method offers a fast, class-agnostic approach to re-animate static 3D objects.	The method involves automatically texturing the input mesh, generating motion with a VDM conditioned on the rendered mesh image, and optimizing mesh animation parameters to match the semantic features between the animated mesh and the generated video.	User study shows a significant preference for the proposed method over existing techniques in terms of motion naturalness, visual quality, and prompt adherence. Quantitative evaluation on a human motion dataset demonstrates superior pose fitting accuracy compared to using RGB features and competitive performance with a state-of-the-art human pose estimator. Ablation study confirms the benefits of single-view texturing, semantic feature utilization, and regularization losses.	The method relies on single-view supervision, limiting its ability to accurately resolve motion-in-depth and handle occlusions. The quality of motion generated by current VDMs can affect the final animation output, highlighting the need for improved VDMs and potential rejection heuristics.	3d animation, video diffusion models, motion fitting, semantic features, zero-shot learning
2405.20141 Report	OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation	Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum	The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.	The paper introduces a new task called "domain adaptation for open-vocabulary segmentation," aiming to improve language-queried object segmentation.	Current Vision Language Models (VLMs), while enabling open-vocabulary segmentation, lag behind domain-specific models in accuracy. This task is important for applications like robotics, where VLMs need to adapt to specific environments while retaining open-vocabulary understanding.	The paper proposes OpenDAS, a method combining parameter-efficient prompt tuning with a triplet-loss-based training strategy to adapt CLIP-based models for better text and image crop matching.	OpenDAS outperforms existing parameter-efficient adaptation strategies in open-vocabulary segment classification tasks. OpenDAS consistently surpasses the original VLM (CLIP) on zero-shot queries, unlike other methods. Integration of OpenDAS into existing OVS pipelines improves performance, as demonstrated by a +6.0% mIoU increase on ADE20K and +4.1% AP increase on ScanNet++ Offices.	All evaluated methods rely on annotated ground-truth segmentation, which can be expensive to obtain. Prompt tuning, while efficient, shows limitations in generalizing to novel queries compared to robust fine-tuning.	open-vocabulary segmentation, domain adaptation, prompt tuning, triplet loss, vision language models
2405.20084 Report	Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach	Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal	Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.	This paper proposes a novel framework for unifying human pose estimation across different datasets by integrating multi-teacher knowledge distillation with a unified skeleton representation.	This approach addresses the challenge of inconsistent annotated skeletons across datasets, limiting the development of universally applicable pose estimation models.	The proposed method utilizes multi-teacher knowledge distillation to train a student network on a unified dataset combining the MPII and COCO datasets. The student network learns to predict a superset of 21 keypoints, encompassing all unique keypoints from both datasets, using a combination of conditional keypoint loss and distillation losses.	The unified model demonstrates enhanced cross-dataset generalization compared to models trained on individual datasets. The model successfully predicts an extended set of 21 keypoints, including those not present in the original annotations of each dataset. Evaluation on the Halpe dataset confirms the model's ability to accurately predict the extended keypoint set.	Potential performance disparities due to dataset size imbalance and hyperparameter settings. Future work includes exploring techniques to extend ground-truth annotations using the unified model and investigating active learning strategies.	human pose estimation, knowledge distillation, cross-dataset learning, unified skeleton representation, keypoint detection
2405.20067 Report	N-Dimensional Gaussians for Fitting of High Dimensional Functions	Stavros Diolatzis, Tobias Zirr, Alexandr Kuznetsov, Georgios Kopanas, Anton Kaplanyan	In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination, or time. In this paper, we tackle these challenges for an explicit representations based on Gaussian mixture models. With our solutions, we arrive at efficient fitting of compact N-dimensional Gaussian mixtures and enable efficient evaluation at render time: For fast fitting and evaluation, we introduce a high-dimensional culling scheme that efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For adaptive refinement yet compact representation, we introduce a loss-adaptive density control scheme that incrementally guides the use of additional capacity towards missing details. With these tools we can for the first time represent complex appearance that depends on many input dimensions beyond position or viewing angle within a compact, explicit representation optimized in minutes and rendered in milliseconds.	Presents a novel method for fitting and evaluating compact N-dimensional Gaussian Mixture Models (GMMs) to represent high-dimensional functions in computer graphics.	Addresses the limitations of existing hybrid and explicit representations in scaling to higher dimensions for complex appearance modeling with many input parameters (e.g., material, lighting, time).	Introduces an unconstrained N-dimensional adaptive Gaussian mixture representation. Employs a Locality Sensitive Hashing-inspired culling scheme for fast fitting and evaluation. Develops a loss-adaptive density control scheme for optimizer-controlled refinement.	Achieves high-quality global illumination of synthetic scenes with variable lighting and materials in minutes. Successfully captures and reconstructs complex view-dependent effects in novel view synthesis. Outperforms implicit and hybrid neural rendering methods in quality and training time for scenes with high-dimensional anisotropy.	Overfitting to sparse viewpoints in real-world captures remains a challenge. Exploring more compact/sparse parameterizations for higher-dimensional data could improve storage efficiency.	gaussian mixture models, high-dimensional data, rendering, explicit representations, locality sensitive hashing
2405.20031 Report	Structure Gaussian SLAM with Manhattan World Hypothesis	Shuhong Liu, Heng Zhou, Liuzhuozheng Li, Yun Liu, Tianchen Deng, Yiming Zhou, Mingrui Li	Gaussian SLAM systems have made significant advancements in improving the efficiency and fidelity of real-time reconstructions. However, these systems often encounter incomplete reconstructions in complex indoor environments, characterized by substantial holes due to unobserved geometry caused by obstacles or limited view angles. To address this challenge, we present Manhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan World hypothesis to enhance geometric accuracy and completeness. By seamlessly integrating fused line segments derived from structured scenes, MG-SLAM ensures robust tracking in textureless indoor areas. Moreover, The extracted lines and planar surface assumption allow strategic interpolation of new Gaussians in regions of missing geometry, enabling efficient scene completion. Extensive experiments conducted on both synthetic and real-world scenes demonstrate that these advancements enable our method to achieve state-of-the-art performance, marking a substantial improvement in the capabilities of Gaussian SLAM systems.	Presents MG-SLAM, a novel RGB-D Gaussian SLAM system that leverages the Manhattan World hypothesis for enhanced geometric accuracy and completeness in complex indoor environments.	Gaussian SLAM systems often struggle with incomplete reconstructions in complex indoor environments due to unobserved geometry. This paper addresses this challenge by incorporating structural information.	Integrates fused line segments for robust tracking in textureless areas and utilizes the Manhattan World assumption to interpolate new Gaussians in regions of missing geometry, enabling efficient scene completion.	Achieves state-of-the-art tracking accuracy with up to 50% lower ATE compared to Gaussian baselines. Significantly improves scene completeness by effectively filling in gaps and holes in the reconstruction, particularly on structured surfaces like floors and ceilings. Provides high-fidelity reconstruction, achieving 5dB enhancement in PSNR on real-world scenes, surpassing existing Gaussian SLAM methods.	Scene completion strategy primarily focuses on large structured surfaces and may not generalize well to complex objects. Future work includes exploring more sophisticated methods for interpolating unobserved geometry in complex indoor environments.	slam, gaussian slam, manhattan world assumption, scene completion, line segment features
2405.19996 Report	DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild	Honghao Fu, Yufei Wang, Wenhan Yang, Bihan Wen	Image quality assessment (IQA) plays a critical role in selecting high-quality images and guiding compression and enhancement methods in a series of applications. The blind IQA, which assesses the quality of in-the-wild images containing complex authentic distortions without reference images, poses greater challenges. Existing methods are limited to modeling a uniform distribution with local patches and are bothered by the gap between low and high-level visions (caused by widely adopted pre-trained classification networks). In this paper, we propose a novel IQA method called diffusion priors-based IQA (DP-IQA), which leverages the prior knowledge from the pre-trained diffusion model with its excellent powers to bridge semantic gaps in the perception of the visual quality of images. Specifically, we use pre-trained stable diffusion as the backbone, extract multi-level features from the denoising U-Net during the upsampling process at a specified timestep, and decode them to estimate the image quality score. The text and image adapters are adopted to mitigate the domain gap for downstream tasks and correct the information loss caused by the variational autoencoder bottleneck. Finally, we distill the knowledge in the above model into a CNN-based student model, significantly reducing the parameter to enhance applicability, with the student model performing similarly or even better than the teacher model surprisingly. Experimental results demonstrate that our DP-IQA achieves state-of-the-art results on various in-the-wild datasets with better generalization capability, which shows the superiority of our method in global modeling and utilizing the hierarchical feature clues of diffusion for evaluating image quality.	This paper presents DP-IQA, a novel blind image quality assessment method that leverages diffusion model priors for evaluating in-the-wild images, addressing the limitations of patch-based methods and the lack of low-level priors in previous approaches.	Blind image quality assessment (BIQA) for in-the-wild images is crucial for various applications but challenging due to the complex and diverse distortions in real-world images.	DP-IQA utilizes a pre-trained stable diffusion model as its backbone, extracting multi-level features from the denoising U-Net. It incorporates text prompts, text adapters, and image adapters to enhance feature representation and mitigate domain gaps. Furthermore, a student model based on EfficientNet is trained via knowledge distillation to improve efficiency.	DP-IQA achieves state-of-the-art performance on four in-the-wild IQA datasets (CLIVE, KonIQ, LIVEFB, SPAQ). The method exhibits superior generalization capability compared to existing BIQA models, as demonstrated by cross-dataset evaluations. Knowledge distillation from the DP-IQA teacher model to the EfficientNet-based student model effectively reduces parameters while maintaining competitive performance.	The performance of DP-IQA may be limited for images with ambiguous scenes or objects due to the potential for insufficient training data. Further investigation is needed to understand the occasional significant deviations between student model predictions and teacher model predictions.	image quality assessment, blind image quality assessment, diffusion models, knowledge distillation, in-the-wild images
2405.19957 Report	PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting	Qiaowei Miao, Yawei Luo, Yi Yang	As text-conditioned diffusion models (DMs) achieve breakthroughs in image, video, and 3D generation, the research community's focus has shifted to the more challenging task of text-to-4D synthesis, which introduces a temporal dimension to generate dynamic 3D objects. In this context, we identify Score Distillation Sampling (SDS), a widely used technique for text-to-3D synthesis, as a significant hindrance to text-to-4D performance due to its Janus-faced and texture-unrealistic problems coupled with high computational costs. In this paper, we propose \textbf{P}ixel-\textbf{L}evel \textbf{A}lignments for Text-to-\textbf{4D} Gaussian Splatting (\textbf{PLA4D}), a novel method that utilizes text-to-video frames as explicit pixel alignment targets to generate static 3D objects and inject motion into them. Specifically, we introduce Focal Alignment to calibrate camera poses for rendering and GS-Mesh Contrastive Learning to distill geometry priors from rendered image contrasts at the pixel level. Additionally, we develop Motion Alignment using a deformation network to drive changes in Gaussians and implement Reference Refinement for smooth 4D object surfaces. These techniques enable 4D Gaussian Splatting to align geometry, texture, and motion with generated videos at the pixel level. Compared to previous methods, PLA4D produces synthesized outputs with better texture details in less time and effectively mitigates the Janus-faced problem. PLA4D is fully implemented using open-source models, offering an accessible, user-friendly, and promising direction for 4D digital content creation. Our project page: https://github.com/MiaoQiaowei/PLA4D.github.io.	This paper introduces PLA4D, a novel text-to-4D generation framework that utilizes text-to-video frames as explicit pixel alignment targets to overcome limitations of Score Distillation Sampling (SDS) in existing methods.	Text-to-4D synthesis is a challenging task with significant potential in various applications, but existing methods suffer from issues like the Janus-face problem, unrealistic textures, and high computational costs due to reliance on SDS.	PLA4D employs a three-stage pipeline: (1) text-to-video generation using an open-source model, (2) frame-to-3D generation via Focal Alignment and GS-Mesh Contrastive Learning for texture and geometry alignment, and (3) 3D-to-4D generation using Motion Alignment and Reference Refinement for injecting motion while preserving surface quality.	PLA4D generates 4D objects with superior texture details, accurate geometry, and coherent motion compared to previous methods. The framework effectively mitigates the Janus-face problem by aligning geometry and texture with generated video frames at the pixel level. PLA4D achieves significantly faster generation times compared to SDS-based methods, reducing training time from hours to around ten minutes.	The motion range of generated 4D objects is limited by the capabilities of current open-source text-to-video generation models. The image-to-mesh model used for initialization has limitations in reconstructing certain targets.	text-to-4d synthesis, 3d gaussian splatting, pixel-level alignment, video generation, deformation network
2405.19931 Report	Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks	Xiaoyu Wu, Jiaru Zhang, Yang Hua, Bohan Lyu, Hao Wang, Tao Song, Haibing Guan	Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by theoretically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks.	This paper identifies and addresses the "corruption stage" phenomenon in few-shot fine-tuning of Diffusion Models (DMs), where image fidelity deteriorates due to noisy patterns during training.	Few-shot fine-tuning is crucial for personalized AI applications, but the corruption stage hinders its effectiveness. This research improves the quality and diversity of generated images in such settings.	The paper theoretically models the fine-tuning process, revealing that a narrowed learning distribution causes the corruption. It proposes using Bayesian Neural Networks (BNNs) to implicitly broaden this distribution, enhancing model robustness.	BNNs significantly mitigate the corruption stage, improving image fidelity and quality as measured by various metrics. The method enhances generation diversity due to the inherent randomness of BNNs. Applying BNNs generalizes well across different DM architectures, training iterations, and numbers of training images.	The added randomness from BNNs might slow down the fine-tuning process. Learning intricate image details could be slightly hampered with limited training iterations. Future work could explore mitigating these limitations.	diffusion models, few-shot fine-tuning, bayesian neural networks, image generation, corruption stage
2405.19899 Report	Open-Set Domain Adaptation for Semantic Segmentation	Seun-An Choe, Ah-Hyung Shin, Keon-Hee Park, Jinwoo Choi, Gyeong-Moon Park	Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer the pixel-wise knowledge from the labeled source domain to the unlabeled target domain. However, current UDA methods typically assume a shared label space between source and target, limiting their applicability in real-world scenarios where novel categories may emerge in the target domain. In this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) for the first time, where the target domain includes unknown classes. We identify two major problems in the OSDA-SS scenario as follows: 1) the existing UDA methods struggle to predict the exact boundary of the unknown classes, and 2) they fail to accurately predict the shape of the unknown classes. To address these issues, we propose Boundary and Unknown Shape-Aware open-set domain adaptation, coined BUS. Our BUS can accurately discern the boundaries between known and unknown classes in a contrastive manner using a novel dilation-erosion-based contrastive loss. In addition, we propose OpenReMix, a new domain mixing augmentation method that guides our model to effectively learn domain and size-invariant features for improving the shape detection of the known and unknown classes. Through extensive experiments, we demonstrate that our proposed BUS effectively detects unknown classes in the challenging OSDA-SS scenario compared to the previous methods by a large margin. The code is available at https://github.com/KHU-AGI/BUS.	This paper introduces Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS), addressing the problem of adapting models to target domains with unknown classes.	Current UDA methods assume shared label spaces, limiting real-world applicability where novel categories can emerge in the target domain.	The paper proposes BUS, a novel method that utilizes a Dilation-Erosion-based Contrastive (DECON) loss to improve boundary prediction and OpenReMix, a domain mixing augmentation for size-invariant feature learning.	BUS significantly outperforms previous UDA and OSDA methods on benchmark datasets (GTA5→Cityscapes and SYNTHIA→Cityscapes). DECON loss effectively distinguishes between known and unknown classes at boundaries, improving private class IoU by ~40.79%. OpenReMix enhances shape prediction for both known and unknown classes, boosting common class mIoU by ~8.65%.	Performance reliance on pseudo-labeling, potentially leading to degradation if model calibration is poor. Future work can explore alternative approaches beyond pseudo-labeling to enhance robustness.	unsupervised domain adaptation, semantic segmentation, open-set learning, domain mixing augmentation, contrastive learning
2405.19876 Report	IReNe: Instant Recoloring in Neural Radiance Fields	Alessio Mazzucchelli, Adrian Garcia-Garcia, Elena Garces, Fernando Rivas-Manzaneque, Francesc Moreno-Noguer, Adrian Penate-Sanchez	Advances in NERFs have allowed for 3D scene reconstructions and novel view synthesis. Yet, efficiently editing these representations while retaining photorealism is an emerging challenge. Recent methods face three primary limitations: they're slow for interactive use, lack precision at object boundaries, and struggle to ensure multi-view consistency. We introduce IReNe to address these limitations, enabling swift, near real-time color editing in NeRF. Leveraging a pre-trained NeRF model and a single training image with user-applied color edits, IReNe swiftly adjusts network parameters in seconds. This adjustment allows the model to generate new scene views, accurately representing the color changes from the training image while also controlling object boundaries and view-specific effects. Object boundary control is achieved by integrating a trainable segmentation module into the model. The process gains efficiency by retraining only the weights of the last network layer. We observed that neurons in this layer can be classified into those responsible for view-dependent appearance and those contributing to diffuse appearance. We introduce an automated classification approach to identify these neuron types and exclusively fine-tune the weights of the diffuse neurons. This further accelerates training and ensures consistent color edits across different views. A thorough validation on a new dataset, with edited object colors, shows significant quantitative and qualitative advancements over competitors, accelerating speeds by 5x to 500x.	\methodname~presents a novel approach for near real-time color editing of pre-trained NeRFs using a single user-edited image.	Existing NeRF color editing techniques are slow, lack precision at object boundaries, and struggle to ensure multi-view consistency, limiting their practical application.	\methodname~achieves fast editing by: 1) Integrating a trainable segmentation module for object boundary control. 2) Selectively fine-tuning only the last layer of the color MLP. 3) Automatically classifying and exclusively fine-tuning diffuse appearance neurons.	Significantly faster editing compared to state-of-the-art methods (5 seconds vs. 1 minute to 2 hours). Improved accuracy, particularly at object boundaries, reducing color bleeding. Enhanced multi-view consistency, ensuring uniform color edits across different viewpoints.	Reliance on external editing tools like Photoshop for complete editing. Occasional suboptimal performance of the soft segmentation model. Future work: Explore in-built editing tools and address indirect illumination from edited objects.	nerf, color editing, 3d scene editing, neural rendering, interactive editing
2405.19854 Report	RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection	Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios Savvides	Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.	This paper proposes RTGen, a novel framework for generating open-vocabulary region-text pairs from image-caption pairs, to enhance open-vocabulary object detection (OVD).	Region-text pairs are crucial for training OVD models but are limited and expensive to annotate. RTGen addresses this by providing a scalable method for generating these pairs.	RTGen employs two processes: 1) Text-to-region generation using a novel Scene-Aware Inpainting Guider (SAIG) and an inpainting model. 2) Region-to-text generation using a captioning model and CLIP similarity for selection. It further introduces a Localization-Aware Region-Text Contrastive Loss (LART) for effective OVD training.	RTGen effectively boosts OVD performance, achieving state-of-the-art results on OV-COCO and OV-LVIS benchmarks. The generated region-text pairs demonstrate scalability, with performance consistently improving as more data is used. SAIG effectively allocates phrases and boxes for inpainting, leading to higher-quality generated data compared to random allocation or grounding methods.	The current generation pipeline relies on multiple models and processes, which can be computationally intensive. Future work could explore improving the efficiency of the generation process and applying RTGen to other open-vocabulary tasks.	open-vocabulary object detection, region-text generation, scene-aware inpainting, contrastive learning, image captioning
2405.19783 Report	Instruction-Guided Visual Masking	Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan	Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at https://github.com/2toinf/IVM.	This paper proposes Instruction-guided Visual Masking (IVM), a plug-and-play visual grounding model to enhance multimodal instruction following by masking out instruction-irrelevant image regions.	Existing large multimodal models often struggle to accurately localize targeted image regions relevant to specific textual instructions, leading to misinterpretations and hallucinations.	The authors create the IVM-Mix-1M dataset with 1 million image-instruction pairs using an LLM-empowered Mixture of Expert pipeline and manual annotations. They then train the IVM model using a Discriminator-Weighted Supervised Learning (DWSL) framework to prioritize high-quality data samples.	IVM significantly improves the performance of both commercial chatbots (e.g., GPT4-V) and open-sourced LMMs (e.g., LLaVA) on challenging multimodal benchmarks like V*Bench, EgoThink, and POPE. IVM-enhanced models outperform baselines on referring expression comprehension benchmarks, demonstrating strong visual grounding capabilities. Real-world robotic control experiments show that IVM enhances the robustness and generalization of language-conditioned behavior cloning agents in the presence of distractions.	IVM introduces additional parameters and computational overhead compared to end-to-end training methods focused solely on downstream tasks. The quality of the IVM model depends on the accuracy of the generated labels and the effectiveness of the DWSL framework in filtering out inaccuracies.	visual grounding, multimodal instruction following, large multimodal models, robotic control, discriminator-weighted supervised learning
2405.19751 Report	HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization	Wenxuan Liu, Sai Qian Zhang	Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.	This paper introduces HQ-DiT, an efficient post-training quantization method using 4-bit floating-point precision for both weights and activations in Diffusion Transformers (DiTs), enabling deployment on resource-limited devices.	DiTs offer superior visual generation but are computationally expensive, hindering their deployment on devices like mobile phones. Model quantization is crucial to reduce these computational demands.	The authors study data distribution in DiTs and employ random Hadamard transforms to mitigate outlier impact on quantization. They propose a method to select optimal floating-point formats based on data distribution and utilize GPTQ for weight quantization and MinMax for activation.	HQ-DiT achieves comparable performance to full-precision models with 4-bit quantization. The method outperforms other quantization approaches like SmoothQuant and FPQ, especially at lower bitwidths (4-bit). HQ-DiT enables a 5.09x speedup and 2.13x memory saving compared to the full-precision model.	The paper primarily focuses on image generation and doesn't explore other DiT applications. Evaluation is limited to ImageNet; further validation on diverse datasets is needed.	diffusion models, model quantization, floating-point quantization, diffusion transformers, post-training quantization
2405.19745 Report	GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis	Boming Zhao, Yuan Li, Ziyu Sun, Lin Zeng, Yujun Shen, Rui Ma, Yinda Zhang, Hujun Bao, Zhaopeng Cui	Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.	This paper introduces GaussianPrediction, a novel framework that leverages 3D Gaussian representations for modeling dynamic scenes and synthesizing future scenarios from arbitrary viewpoints, using video observations.	Predicting future scenarios in dynamic environments, including dense motion forecasting and visualization from any viewpoint, is crucial for intelligent systems in computer vision and robotics.	The framework builds a 3D Gaussian canonical space with deformation modeling and lifecycle properties to capture scene dynamics and irreversible deformations. It employs concentric motion distillation with key points to efficiently predict scene motion using a Graph Convolutional Network (GCN). Finally, it renders photorealistic images of future scenarios from novel viewpoints.	GaussianPrediction outperforms existing NeRF-based and Gaussian-based methods in novel view synthesis of dynamic scenes. It demonstrates superior performance in short-term future scenario synthesis, showcasing more realistic and coherent predictions. The framework effectively handles complex motions and irreversible deformations, such as cutting or splitting objects.	The model's reliance on input observations for motion prediction without pre-training limits its capacity for long-term forecasting. Inaccuracies in camera poses and timestamps in real-world datasets pose challenges for quantitative evaluation of prediction results.	novel view synthesis, dynamic scene modeling, motion prediction, 3d gaussian representations, graph convolutional network
2405.19726 Report	Streaming Video Diffusion: Online Video Editing with Diffusion Models	Feng Chen, Zhen Yang, Bohan Zhuang, Qi Wu	We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.	This paper proposes Streaming Video Diffusion (SVDiff), an online video editing method that edits streaming video frames with temporal consistency using a novel compact spatial-aware temporal memory.	Online video editing, crucial for live streaming and online chat, demands real-time processing of video frames while maintaining temporal coherence. Existing offline editing methods are ill-suited for this task due to their reliance on pre-established frames and limitations in handling long video sequences.	SVDiff integrates a compact spatial-aware temporal memory into Stable Diffusion. This memory is recursively updated with each incoming frame to capture both spatial and temporal information. The model is trained on long videos by splitting them into short segments while propagating the temporal memory between them.	SVDiff generates high-quality, long videos with strong adherence to edit prompts while preserving temporal consistency. It outperforms baseline models adapted for online video editing in terms of both qualitative results and quantitative metrics (CLIP, user study). The method achieves real-time inference speed (15.2 FPS at 512x512 resolution) due to its efficient memory usage and recurrent design.	The current implementation might struggle to accurately detect shot changes in videos exceeding 2 minutes due to training-inference discrepancies. Future work will focus on mitigating the influence of training-inference gap to better handle long videos with complex scene transitions.	video editing, streaming processing, diffusion models, temporal consistency, real-time
2405.19712 Report	HINT: Learning Complete Human Neural Representations from Limited Viewpoints	Alessandro Sanvito, Andrea Ramazzina, Stefanie Walz, Mario Bijelic, Felix Heide	No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.	HINT, a NeRF-based algorithm that reconstructs a complete, animatable human model from limited viewing angles using symmetry priors, regularization constraints, and cues from large human datasets.	Crucial for generating realistic human avatars in augmented applications and for creating counterfactual examples in robotics and autonomous navigation, especially when data from limited viewpoints is common.	Combines a NeRF-based background model with an SDF-based human model. Leverages symmetry constraints, direct SDF supervision using a 3D body model, and a co-trained human digitization network to infer information for occluded areas.	Reconstructs complete humans even from sparse viewpoints, enabling novel view synthesis and pose generation. Outperforms state-of-the-art methods by more than 15% PSNR and 34% LPIPS. Demonstrates the effectiveness of direct SDF supervision over Eikonal loss in limited viewpoint scenarios.	Relies on pre-trained models (SMPL, segmentation, depth estimation) which might impact performance. Limited evaluation on highly dynamic scenes with complex occlusions.	human modeling, neural radiance fields, nerf, view synthesis, data augmentation
2405.19708 Report	Text Guided Image Editing with Automatic Concept Locating and Forgetting	Jia Li, Lijie Hu, Zhixian He, Jingfeng Zhang, Tianhang Zheng, Di Wang	With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.	This paper presents Locate and Forget (LaF), a novel method for improving text-guided image editing in diffusion models by addressing the challenge of accurately locating and modifying specific concepts within complex image scenes based on textual instructions.	Existing text-guided image editing models often struggle to accurately align textual instructions with visual modifications, leading to edits that may not fully reflect user intent. LaF aims to overcome this limitation by leveraging scene descriptions to precisely locate target concepts and guide the diffusion model to selectively forget those concepts during the denoising process.	LaF employs a two-step process: 1) Concept Location: The method generates a scene description of the input image and compares its syntactic tree to the input text prompt to identify the specific concepts targeted for editing. 2) Concept Forgetting: During the denoising steps of the diffusion process, LaF utilizes negative guidance based on the identified concepts, enabling the model to selectively forget or remove those concepts from the generated image.	LaF demonstrates superior performance in aligning generated images with textual instructions, as evidenced by higher CLIP-T scores compared to baseline methods. The method exhibits a good balance between editing fidelity and visual quality, achieving competitive Inception Scores while effectively modifying target concepts. Human preference studies confirm that LaF produces more desirable editing outcomes, with users rating it higher in terms of alignment, fidelity, consistency, and overall preference.	One limitation is the difficulty in precisely controlling numerical attributes, such as object counts or sizes, during the editing process. Further research is needed to extend LaF's capabilities to handle more complex editing scenarios, such as multi-object interactions or edits requiring nuanced spatial reasoning.	text-guided image editing, diffusion models, concept forgetting, scene understanding, multi-modal learning
2405.19707 Report	DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark	Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, Huaxiong Li	Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}.	This paper introduces GenVideo, the first large-scale dataset for AI-generated video detection, featuring over a million videos and diverse generation techniques. It also proposes DeMamba, a plug-and-play module that enhances video detectors by identifying spatial-temporal inconsistencies.	The rapid advancement of video generation techniques raises concerns about the spread of misinformation. Existing datasets lack the scale and diversity to train robust detectors, hindering efforts to mitigate potential harm.	GenVideo is built by collecting real videos from established datasets and generating fake videos using various state-of-the-art techniques. DeMamba leverages a structured state-space model to analyze local inconsistencies across video frames.	DeMamba significantly improves the performance of existing detectors in cross-generator generalization tasks. The proposed method exhibits strong robustness against video degradation like compression and watermarking. Ablation studies confirm the importance of DeMamba's components, zone size, and scanning order.	The training efficiency of DeMamba remains suboptimal, requiring further exploration for lightweight design. Future work includes expanding GenVideo with more diverse and challenging AI-generated video content.	ai-generated video detection, misinformation detection, video generation, dataset, spatial-temporal inconsistency
2405.19671 Report	GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction	Haodong Xiang, Xinghui Li, Xiansong Lai, Wanting Zhang, Zhichao Liao, Kai Cheng, Xueping Liu	Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering with its high-quality rendering and real-time speed. However, when it comes to indoor scenes with a significant number of textureless areas, 3DGS yields incomplete and noisy reconstruction results due to the poor initialization of the point cloud and under-constrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we present a unified optimizing framework integrating neural SDF with 3DGS. This framework incorporates a learnable neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to accurately model scenes even with poor initialized point clouds. At the same time, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we regularize the optimization with normal and edge priors to eliminate geometry ambiguity in textureless areas and improve the details. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis.	This paper introduces GaussianRoom, a novel 3D reconstruction framework that integrates neural Signed Distance Fields (SDF) with 3D Gaussian Splatting (3DGS) to enhance the reconstruction of indoor scenes, particularly in textureless areas.	Existing methods like 3DGS struggle with incomplete reconstructions in indoor scenes with vast textureless regions, while SDF-based methods, though accurate, are computationally expensive. GaussianRoom addresses these limitations by leveraging the strengths of both approaches.	The framework employs a mutually beneficial learning strategy: SDF guides the distribution of Gaussian primitives to align with the scene surface, while 3DGS aids in efficient point sampling for the SDF. Additionally, it incorporates monocular normal priors and edge priors to improve geometry reconstruction in textureless areas and enhance detail rendering.	GaussianRoom outperforms state-of-the-art methods in geometry reconstruction metrics like accuracy, completion, and F-score on both ScanNet and ScanNet++ datasets. The method exhibits superior rendering quality compared to existing Gaussian-based methods, evident from improvements in SSIM, PSNR, and LPIPS metrics. Ablation studies confirm the effectiveness of each individual module, particularly the SDF guidance for Gaussian distribution, Gaussian-guided sampling for SDF, and the use of normal and edge priors.	The neural SDF optimization, although more efficient than some NeRF-based methods, is still computationally more demanding than 3DGS, presenting a bottleneck for training time. Future work could focus on improving the efficiency of MLP-based neural SDF to accelerate the overall training process.	3d reconstruction, 3d gaussian splatting, neural signed distance fields, indoor scenes, textureless areas
2405.19657 Report	Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian	Wei Sun, Qi Zhang, Yanzhao Zhou, Qixiang Ye, Jianbin Jiao, Yuan Li	3D Gaussian splatting has demonstrated impressive performance in real-time novel view synthesis. However, achieving successful reconstruction from RGB images generally requires multiple input views captured under static conditions. To address the challenge of sparse input views, previous approaches have incorporated depth supervision into the training of 3D Gaussians to mitigate overfitting, using dense predictions from pretrained depth networks as pseudo-ground truth. Nevertheless, depth predictions from monocular depth estimation models inherently exhibit significant uncertainty in specific areas. Relying solely on pixel-wise L2 loss may inadvertently incorporate detrimental noise from these uncertain areas. In this work, we introduce a novel method to supervise the depth distribution of 3D Gaussians, utilizing depth priors with integrated uncertainty estimates. To address these localized errors in depth predictions, we integrate a patch-wise optimal transport strategy to complement traditional L2 loss in depth supervision. Extensive experiments conducted on the LLFF, DTU, and Blender datasets demonstrate that our approach, UGOT, achieves superior novel view synthesis and consistently outperforms state-of-the-art methods.	Introduces UGOT, an Uncertainty-guided Optimal Transport approach for depth supervision in sparse-view 3D Gaussian splatting for novel view synthesis.	Addresses the challenge of overfitting and geometric inaccuracies in 3D Gaussian splatting with sparse input views, particularly focusing on the limitations of traditional pixel-wise depth supervision.	Leverages depth priors with integrated uncertainty estimates from generative diffusion models to guide depth optimization and employs a patch-wise optimal transport strategy to align the depth distribution of Gaussian splats with the depth prior.	Achieves state-of-the-art results on LLFF, DTU, and Blender datasets, demonstrating superior novel view synthesis quality compared to existing methods. Effectively mitigates the impact of noisy or uncertain depth estimations, leading to more accurate and robust 3D scene reconstruction from sparse views. Maintains the real-time rendering capabilities of 3D Gaussian splatting while significantly improving the quality of reconstruction in sparse-view scenarios.	Limited performance improvement in reconstructing untextured backgrounds and voids due to inherent limitations of 3D Gaussian splatting. Reliance on pre-trained monocular depth estimation models, which may introduce biases or inaccuracies depending on the training data and domain.	novel view synthesis, 3d gaussian splatting, depth supervision, optimal transport, uncertainty estimation
2405.19614 Report	TAMBRIDGE: Bridging Frame-Centered Tracking and 3D Gaussian Splatting for Enhanced SLAM	Peifeng Jiang, Hong Liu, Xia Li, Ti Wang, Fabian Zhang, Joachim M. Buhmann	The limited robustness of 3D Gaussian Splatting (3DGS) to motion blur and camera noise, along with its poor real-time performance, restricts its application in robotic SLAM tasks. Upon analysis, the primary causes of these issues are the density of views with motion blur and the cumulative errors in dense pose estimation from calculating losses based on noisy original images and rendering results, which increase the difficulty of 3DGS rendering convergence. Thus, a cutting-edge 3DGS-based SLAM system is introduced, leveraging the efficiency and flexibility of 3DGS to achieve real-time performance while remaining robust against sensor noise, motion blur, and the challenges posed by long-session SLAM. Central to this approach is the Fusion Bridge module, which seamlessly integrates tracking-centered ORB Visual Odometry with mapping-centered online 3DGS. Precise pose initialization is enabled by this module through joint optimization of re-projection and rendering loss, as well as strategic view selection, enhancing rendering convergence in large-scale scenes. Extensive experiments demonstrate state-of-the-art rendering quality and localization accuracy, positioning this system as a promising solution for real-world robotics applications that require stable, near-real-time performance. Our project is available at https://ZeldaFromHeaven.github.io/TAMBRIDGE/	This paper introduces a novel 3DGS-based SLAM system called TAMBRIDGE that enhances the convergence of online 3DGS by incorporating a plug-and-play Fusion Bridge module. This module integrates tracking-centered ORB Visual Odometry with mapping-centered online 3DGS, enabling precise pose initialization and strategic viewpoint selection.	This addresses the limitations of existing 3DGS-based SLAM systems, which struggle with real-time performance and robustness against sensor noise and motion blur, especially in long-duration robotic tasks.	The system employs a four-module structure: a Tracking-centered Frontend Module, a Tracking-centered Global Optimization Module, a Plug and Play Fusion Bridge Module, and an Online 3DGS Backend Module. The Fusion Bridge module is crucial, filtering keyframes, jointly optimizing rendering poses with border masks, and minimizing reprojection and rendering errors.	TAMBRIDGE achieves state-of-the-art rendering quality and localization accuracy, comparable to SplaTAM but significantly faster. The system consistently maintains near real-time performance (>5 FPS) even in long-session robotic tasks, outperforming existing NeRF-based and 3DGS-based methods. Ablation studies highlight the importance of the Fusion Bridge module in bridging the gap between the tracking and mapping paradigms, thereby improving the accuracy and quality of the reconstruction.	The Viewpoint Selection in the Fusion Bridge relies on manual thresholds and lacks self-learning, potentially limiting its adaptability. The evaluation primarily focuses on the TUM RGB-D dataset. Expanding to more datasets and exploring alternative SLAM frontends could further validate its generalizability.	slam, 3d gaussian splatting, robotics perception, real-time performance, sensor noise robustness
2405.19609 Report	SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations	Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao	Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.	This paper introduces SMPLX-Lite, a comprehensive dataset for photorealistic and drivable avatar research, and proposes SMPLX-Lite-D, a new parametric model optimized for fitting detailed clothing geometry while maintaining facial and hand fidelity.	Creating realistic, animatable avatars is crucial for various applications, but existing datasets often lack detailed clothing models or parametric representations suitable for driving.	The authors capture a multi-view dataset of 5 subjects performing 15 actions, reconstruct textured meshes, fit SMPLX-Lite-D models, and train a conditional variational autoencoder (CVAE) to generate avatars from pose and facial keypoints.	SMPLX-Lite dataset offers multi-view images, 3D keypoints, textured scanned meshes, and fitted SMPLX-Lite-D models with textures, enabling comprehensive research in drivable avatars. SMPLX-Lite-D model, derived from SMPL-X, simplifies vertex fitting for clothing while retaining high-fidelity facial and hand representation. The trained CVAE model effectively generates photorealistic avatars driven by pose and facial keypoints, outperforming baselines in novel view and pose synthesis.	The current dataset, while extensive, could benefit from further diversity in action sequences and clothing styles. The proposed driving algorithm, while effective, can be improved to enhance generalization capabilities and facial expression control.	drivable avatars, dataset, 3d human reconstruction, parametric models, conditional variational autoencoder
2405.19450 Report	FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining	Dong Li, Yidi Liu, Xueyang Fu, Senyan Xu, Zheng-Jun Zha	Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owning to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation.	Proposes FourierMamba, a novel image deraining framework that leverages Mamba, a type of State Space Model, to correlate different frequencies within the Fourier domain, improving rain streak removal and image restoration.	Previous Fourier-based deraining methods fail to fully utilize frequency information by neglecting correlations between different frequencies, limiting their effectiveness.	FourierMamba employs a multi-scale U-Net architecture with Fourier Residual State-Space Blocks (FRSSB). These blocks implement: 1) Fourier Spatial Interaction SSM: utilizes zigzag-based scanning methods to correlate frequencies in the spatial dimension of Fourier space, addressing directional sensitivity limitations, and 2) Fourier Channel Evolution SSM: directly applies Mamba to correlate ordered frequencies in the channel dimension of Fourier space, improving global feature representation.	FourierMamba achieves state-of-the-art performance on benchmark datasets (Rain100H, Rain100L, Test2800, Test1200), outperforming existing methods in both PSNR and SSIM metrics. Ablation studies demonstrate the effectiveness of key components like Fourier Spatial/Channel SSMs, Fourier priors, and proposed zigzag scanning methods. Qualitative analysis highlights FourierMamba's superior performance in removing rain streaks and restoring image details, particularly in complex and severe rain conditions.	The reliance on fixed scanning patterns may limit adaptability to varying rain characteristics. Exploring adaptive scanning strategies based on rain density and direction could further enhance deraining performance.	image deraining, fourier transform, state space models, mamba, frequency correlation
2405.19335 Report	X-VILA: Cross-Modality Alignment for Large Language Model	Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin	We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.	Introduces X-VILA, an omni-modality LLM that integrates image, video, and audio modalities to achieve cross-modality understanding, reasoning, and generation.	Extends LLM capabilities beyond text, enabling multi-modal conversations and content generation.	Aligns modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs. Utilizes a novel two-phase alignment mechanism (textual and visual) and introduces a visual embedding highway (VEH) to preserve visual details.	Achieves any-to-any modality (X-to-X) conversation, surpassing previous approaches. Demonstrates emergent properties, like long-context cross-modality generation and unseen cross-modality abilities (e.g., image-to-audio). Shows significant improvements in visual correspondence on X-to-X alignment benchmarks compared to state-of-the-art methods.	Further performance improvement is possible across VLM benchmarks. Exploring other techniques beyond VEH to further improve visual alignment.	multi-modality, large language model, cross-modality alignment, visual embedding highway, x-to-x generation
2405.19331 Report	NPGA: Neural Parametric Gaussian Avatars	Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner	The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian Splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars' dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. To increase the representational capacity of our avatars, we augment the canonical Gaussian point cloud using per-primitive latent features which govern its dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.	This paper proposes Neural Parametric Gaussian Avatars (NPGA), a method for creating high-fidelity, controllable avatars from multi-view videos by leveraging the expressive power of Neural Parametric Head Models (NPHMs).	Creating realistic and controllable digital avatars is crucial for various applications like VR, AR, and the metaverse.	The method distills the backward deformation field of a pre-trained NPHM into a forward deformation field compatible with 3D Gaussian Splatting (3DGS). This prior guides the avatar's motion, while per-Gaussian latent features and a detail network capture fine-grained details and appearance changes. The approach uses a cycle-consistency loss for distillation and optimizes the avatar using a photometric loss with Laplacian regularization.	NPGA outperforms previous state-of-the-art methods on self-reenactment by a significant margin (2.6 PSNR improvement). The method enables accurate cross-reenactment, transferring expressions from one person to the avatar. The avatars can be animated from monocular RGB videos, demonstrating applicability outside controlled environments.	The controllability is limited by the underlying NPHM, restricting animation of regions like the neck and torso. The data-driven nature limits the avatar to expressions observed in the training data.	avatar creation, 3d gaussian splatting, neural parametric head model, facial reenactment, multi-view video
2405.19326 Report	Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models	Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, Lingyun Sun	In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to "segment anything" in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: http://tianrun-chen.github.io/Reason3D/	Introduces Zero-Shot 3D Reasoning Segmentation for part localization in 3D objects using natural language, going beyond traditional 3D segmentation limitations.	Enables flexible and intuitive interaction with 3D objects using natural language, beneficial for robotics, AR/VR, and other fields.	Leverages pre-trained 2D reasoning segmentation networks and LLMs to interpret user queries, rendering 3D objects from multiple viewpoints for 2D processing and fusing the results back into 3D.	Achieves competitive performance on open-vocabulary 3D segmentation benchmarks. Successfully segments parts of 3D objects based on implicit textual queries. Provides natural language explanations for segmentation results.	Comprehensive benchmarking and user studies are needed for further evaluation. Optimizing viewpoint selection could improve performance.	3d segmentation, reasoning segmentation, 3d part understanding, large language models, large vision-language models
2405.19315 Report	Matryoshka Query Transformer for Large Vision-Language Models	Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang	Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.	Introduces Matryoshka Query Transformer (MQT), enabling flexible selection of visual token numbers in Large Vision-Language Models (LVLMs) at inference time, adapting to computational constraints.	Existing LVLMs use a fixed number of visual tokens, posing challenges for tasks with varying computational resources or requiring different levels of visual granularity.	Trains a query transformer with a Matryoshka structure, randomly dropping tail tokens during training to enable inference with any number of tokens up to a predefined maximum.	Achieves comparable or superior performance to fixed-token models. Matches LLaVA-1.5 performance on 11 benchmarks with less than half the tokens. Exhibits different token sensitivity across tasks, with some remaining robust even with drastically reduced tokens.	Current maximum token number at inference limited to 256. Future work to explore exceeding training token limits during inference.	vision-language models, matryoshka representation learning, query transformer, elastic inference, computational efficiency
2405.19237 Report	ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning	Ruchika Chavhan, Da Li, Timothy Hospedales	While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.	This paper introduces ConceptPrune, a training-free method for concept editing in pre-trained diffusion models by identifying and pruning skilled neurons responsible for generating undesired concepts.	This method addresses the risks of large-scale text-to-image models generating unsafe content, violating copyright, and perpetuating societal biases by providing a more efficient and robust alternative to existing concept editing and unlearning techniques.	ConceptPrune identifies skilled neurons in Feed-Forward Networks (FFNs) of diffusion models by comparing the importance scores of neuron activations for target and reference concepts using a pruning strategy inspired by Wanda. These skilled neurons are then pruned to eliminate the undesired concept.	ConceptPrune effectively erases undesired concepts like artistic styles, nudity, and objects, as demonstrated by quantitative metrics and qualitative examples. ConceptPrune exhibits strong robustness to both white-box and black-box adversarial attacks aimed at circumventing concept erasure. Concept-generating neurons are localized to a very compact subspace, suggesting efficient concept editing with minimal impact on overall model performance.	There might be some degree of interference when erasing fine-grained classes or concepts. Erasing a large number of objects simultaneously may degrade the overall image generation quality.	concept editing, diffusion models, model pruning, adversarial robustness, text-to-image generation
2405.19209 Report	VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos	Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal	Video-language understanding tasks have focused on short video clips, often struggling with long-form video understanding tasks. Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions. However, the frames used for captioning are often redundant and contain irrelevant information, making dense sampling inefficient, and ignoring the fact that video QA requires varying levels of granularity, with some video segments being highly relevant to the question (needing more fine-grained detail) while others being less relevant. Thus, these LLM-based approaches are prone to missing information and operate on large numbers of irrelevant captions, lowering both performance and efficiency. To address these issues, we introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. VideoTree dynamically extracts query-related information from a video and builds a tree-based representation for LLM reasoning. First, VideoTree adaptively selects frames for captioning by iteratively clustering frames based on their visual features and scoring clusters using their relevance to the query. Second, it organizes visual clusters into a query-adaptive and hierarchical tree structure; the tree encodes varying levels of granularity, with higher resolution on relevant segments. Finally, VideoTree produces an answer by traversing the tree's keyframes and passing their captions to an LLM answerer. Our method improves both reasoning accuracy and efficiency compared to existing methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% accuracy gain over baselines on the EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while reducing inference time by 40%.	This paper introduces AdaTree, a query-adaptive and hierarchical framework that enhances long-video understanding with Large Language Models (LLMs) by dynamically building a tree-based video representation.	Existing LLM-based long-video understanding methods suffer from inefficiencies and inaccuracies due to redundant frame information, lack of query adaptation, and inability to capture hierarchical video structure.	AdaTree uses a three-step process: (1) Adaptive Breadth Expansion: Clusters frames based on visual features and relevance to the query, (2) Relevance-Guided Depth Expansion: Explores relevant clusters in a coarse-to-fine manner to extract detailed information, (3) LLM-based Reasoning: Leverages captions from selected keyframes for question answering.	AdaTree achieves state-of-the-art accuracy on EgoSchema, NExT-QA, and IntentQA benchmarks, outperforming previous methods by significant margins. The method demonstrates both improved accuracy and efficiency, requiring fewer captions than uniform sampling baselines to achieve comparable or better performance. Qualitative analysis reveals AdaTree's ability to effectively identify and focus on query-relevant video segments while filtering out irrelevant information.	The effectiveness of AdaTree relies on the accuracy of the VLM captioner used. While training-free, AdaTree includes hyperparameters, though experiments demonstrate its robustness even with suboptimal settings.	long-form video understanding, large language models, query-adaptive representation, hierarchical video representation, video question answering
2405.19035 Report	A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation	Niclas Vödisch, Kürsat Petek, Markus Käppeler, Abhinav Valada, Wolfram Burgard	A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de.	This paper introduces PASTEL, a novel approach for label-efficient panoptic segmentation that leverages the descriptive image features from the DINOv2 foundation model.	Reducing the dependency on large, densely annotated datasets for training segmentation models is crucial for lowering operational costs and speeding up deployment, particularly in robotics.	PASTEL employs a frozen DINOv2 backbone for feature extraction and trains lightweight heads for semantic segmentation and object boundary detection using very few annotated images. It then merges the predictions through a novel fusion module based on normalized cut and refines performance via self-training on unlabeled, visually similar images.	PASTEL achieves state-of-the-art performance on Cityscapes, Pascal VOC, and PhenoBench datasets using significantly fewer labeled images than previous methods (as few as 10 annotated images). The method effectively leverages self-training on unlabeled data to further improve segmentation accuracy. PASTEL can be used as a plugin to generate pseudo-labels, rendering conventional densely supervised models label-efficient.	The current method struggles to assign the same instance ID to different parts of the same object when occlusion is present, leading to over-segmentation. All semantic classes must be present in the few labeled training images, limiting applicability in some scenarios.	panoptic segmentation, label-efficient learning, foundation models, dinov2, self-training
2405.18991 Report	EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture	Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang	This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.	Introduces EasyAnimate, a high-performance AI video generation pipeline based on transformer architecture, featuring a motion module for smooth transitions and adaptable frame/resolution settings.	Addresses limitations in existing video generation models like poor quality, limited length, and unnatural movement.	Expands the DiT framework with a motion module, slice VAE for long video generation, and a three-stage training process using image and video data.	Achieves high-performance video generation with consistent frames and smooth motion. Generates videos with different frame rates and resolutions, suitable for both images and videos. Enables long-duration video generation (up to 144 frames currently).	Video quality still being improved. Further exploration on motion module design for enhanced motion generation.	video generation, transformer, motion module, slice vae, dit
2405.18937 Report	Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding	Junjie Fei, Mahmoud Ahmed, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny	While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess this capability. Therefore, we propose two novel tasks: (1) Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions, and (2) Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions and their corresponding masks. To support learning and evaluating for these tasks, we introduce 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs' ability of part-aware segmentation grounding. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs' part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can generate user-specified segmentation masks, a capability not present in any existing 3D MLLM. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects. Project page at https://feielysia.github.io/Kestrel.github.io/	This paper introduces Kestrel, a 3D Multimodal Large Language Model (MLLM) that understands and grounds objects at the part level.	Existing 3D MLLMs struggle to understand 3D structures at the part level, limiting their ability to interact with the environment in a nuanced way, like humans.	The paper introduces two new tasks: (1) part-aware point grounding - predicting part-level segmentation masks based on user instructions, and (2) part-aware point grounded captioning - generating detailed captions with part-level descriptions and corresponding segmentation masks. A new dataset, 3DCoMPaT-GRIN, is created for these tasks. Kestrel incorporates a 3D segmentation grounding module to enable part-level understanding.	Kestrel significantly outperforms baseline models in part-aware point grounding, demonstrating accurate part and material localization. Kestrel excels in part-aware point grounded captioning, generating detailed descriptions and accurately grounding mentioned parts. Ablation studies show the importance of LoRA rank and the choice of projection layer in Kestrel's performance.	Current annotation in 3DCoMPaT-GRIN is limited to part and material masks and could be extended to include more part-level attributes. Future work aims to extend the part-aware segmentation grounding capability beyond single objects to enhance interaction with the 3D world.	3d vision-language models, part-aware understanding, segmentation grounding, 3d point cloud understanding, multimodal learning
2405.18897 Report	MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning	Junjie Wang, Guangjing Yang, Wentao Chen, Huahui Yi, Xiaohu Wu, Qicheng Lao	In response to the challenges posed by the extensive parameter updates required for full fine-tuning of large-scale pre-trained models, parameter-efficient fine-tuning (PEFT) methods, exemplified by Low-Rank Adaptation (LoRA), have emerged. LoRA simplifies the fine-tuning process but may still struggle with a certain level of redundancy in low-rank matrices and limited effectiveness from merely increasing their rank. To address these issues, a natural idea is to enhance the independence and diversity of the learning process for the low-rank matrices. Therefore, we propose Masked LoRA Experts (MLAE), an innovative approach that applies the concept of masking to PEFT. Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices, or ``experts'', thus enhancing independence. Additionally, we introduce a binary mask matrix that selectively activates these experts during training to promote more diverse and anisotropic learning, based on expert-level dropout strategies. Our investigations reveal that this selective activation not only enhances performance but also fosters a more diverse acquisition of knowledge with a marked decrease in parameter similarity among MLAE, significantly boosting the quality of the model while barely increasing the parameter count. Remarkably, MLAE achieves new SOTA performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark, demonstrating superior performance. Our code is available at https://github.com/jie040109/MLAE.	The paper proposes Masked LoRA Experts (MLAE), a novel parameter-efficient fine-tuning method that applies masking to enhance the independence and diversity of learning in low-rank matrices.	Existing parameter-efficient fine-tuning methods, particularly those based on Low-Rank Adaptation (LoRA), struggle with redundancy and limited effectiveness in improving model quality. MLAE addresses these limitations by promoting diverse and independent learning in low-rank matrices.	MLAE decomposes low-rank matrices into rank-1 submatrices, treating them as independent experts. It then introduces a mask matrix with adaptive coefficients, applying it to the decomposed matrix to selectively activate experts during training. This selective activation, implemented through expert-level dropout, enhances diversity and reduces redundancy.	MLAE achieves state-of-the-art performance on the VTAB-1k benchmark with an average accuracy of 78.8% and on the FGVC benchmark with 90.9% accuracy. The method demonstrates significantly reduced parameter similarity compared to vanilla LoRA, indicating enhanced independence among learned experts. Feature attention map visualizations reveal that different MLAE experts focus on distinct feature areas within the same block, highlighting the diversity and complementarity of their representations.	The optimal probability of stochastic masking varies across datasets, necessitating dataset-specific tuning. Future work could explore metrics to determine optimal masking probabilities based on dataset characteristics or training performance, and investigate layer-wise optimal probabilities.	parameter-efficient fine-tuning, low-rank adaptation (lora), masking strategies, vision transformers (vit), transfer learning
2405.18852 Report	LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping	Nikhil Gosala, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Paulo Drews-Jr, Wolfram Burgard, Abhinav Valada	Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data.	\net~is the first unsupervised representation learning framework to predict semantic Bird's Eye View (BEV) maps from monocular front-view images in a label-efficient manner.	Semantic BEV maps are essential for autonomous driving but most current approaches rely heavily on large, annotated datasets which are difficult and time-consuming to create.	The framework uses two disentangled neural pathways: one for scene geometry modeling using implicit fields and another for scene representation learning using a novel temporal masked autoencoder. These pathways are pretrained in an unsupervised manner and then fine-tuned for semantic BEV mapping using a small fraction of labeled data.	\net~outperforms most existing fully-supervised and self-supervised methods on KITTI-360 using only 1% of BEV labels. On the nuScenes dataset, \net~achieves comparable performance to most fully-supervised baselines despite the challenge of dynamic scenes. Ablation studies demonstrate the contributions of individual components, including the importance of pretraining and the effectiveness of the temporal masked autoencoder.	The implicit field formulation assumes a static scene, limiting performance in highly dynamic environments. The reliance on photometric loss for supervision makes the model sensitive to varying lighting and occlusions.	unsupervised representation learning, semantic bev mapping, scene understanding, autonomous driving, label-efficient learning
2405.18842 Report	Descriptive Image Quality Assessment in the Wild	Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Tianfan Xue, Chao Dong	With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/.	This paper introduces DepictQA-Wild, a multi-functional VLM-based Image Quality Assessment (IQA) model that handles a wide range of IQA tasks and overcomes limitations of previous models in functionality and performance.	Existing VLM-based IQA models are limited to specific sub-tasks and exhibit sub-optimal performance due to limitations in dataset coverage, scale, and quality. DepictQA-Wild addresses these limitations to provide a more practical and versatile IQA solution.	The authors define a multi-functional IQA task paradigm encompassing assessment and comparison, brief and detailed responses, and full-reference and non-reference scenarios. They construct a large-scale, high-quality dataset, DQ-495K, using a ground-truth-informed generation approach. The model is trained while retaining image resolution and incorporates confidence estimation.	DepictQA-Wild significantly outperforms traditional score-based IQA methods, prior VLM-based IQA models, and GPT-4V in various tasks, including distortion identification, instant rating, and reasoning. The model shows strong generalization ability, achieving high accuracy even in out-of-distribution settings. DepictQA-Wild demonstrates its practicality in real-world applications, such as assessing web-downloaded images and ranking model-processed images.	The model's fine-grained abilities requiring high-level perception skills need further improvement. The task paradigm can be extended to include comparisons among images with different contents and incorporate image aesthetics.	image quality assessment, vision language models, multi-functional iqa, large-scale dataset, deep learning
2405.18840 Report	Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation	Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Lingxi Xie, Qi Tian, Wei Shen	Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.	This paper introduces H-CLIP, a symmetric parameter-efficient fine-tuning (PEFT) strategy for CLIP, enhancing open-vocabulary semantic segmentation by addressing limitations of existing methods.	Fine-tuning CLIP for pixel-level prediction often leads to high computational costs, misalignment between CLIP's modalities, and reduced generalization ability on unseen categories. H-CLIP aims to tackle these challenges.	H-CLIP utilizes a partial orthogonal fine-tuning strategy in hyperspherical space, employing block-diagonal learnable transformation matrices. Orthogonal constraints are applied to CLIP's text encoder to preserve generalization. A dual cross-relation communication module facilitates alignment between modalities and layers.	H-CLIP achieves state-of-the-art open-vocabulary semantic segmentation results on multiple benchmarks. It achieves this while only updating approximately 4% of CLIP's total parameters, demonstrating its efficiency. Ablation studies confirm the individual contributions of partial orthogonal fine-tuning and dual cross-relation communication.	The performance of H-CLIP is still dependent on the design of the block dimension. Further exploration of more effective communication mechanisms within H-CLIP is a potential avenue for improvement.	open-vocabulary semantic segmentation, clip, parameter-efficient fine-tuning, hyperspherical energy, dual cross-relation communication
2405.18831 Report	Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks	Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis	As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.	This paper presents a case study evaluating the zero-shot performance of GPT-4 Vision and GPT-4 on established 3D VQA benchmarks (3D-VQA and ScanQA) to understand how these foundational models impact existing closed-vocabulary datasets.	With growing interest in adapting 3D VQA for foundation models, it's crucial to understand how these models perform on existing benchmarks and how they compare to traditional approaches.	The study uses GPT-4V for captioning scene meshes, GPT-4 Turbo to answer questions based on these captions, and compares their performance to existing baselines on ScanQA and 3D-VQA. They investigate different captioning schemes (open-vocabulary and vocabulary-grounded) and analyze the impact of different parameters like frame sample rate and batch size.	Finetuning-free GPT agents perform surprisingly well, achieving scores within 10% of meticulously crafted DNN-based baselines on ScanQA. Blind GPT agents (without visual input) demonstrate surprisingly robust performance, highlighting the power of language priors and 'common sense'. GPT-V benefits significantly from scene-specific vocabulary during captioning, indicating the importance of grounded language descriptions.	The study primarily focuses on zero-shot performance and doesn't explore finetuning GPT models on these specific datasets. While the study analyzes the impact of several parameters, more comprehensive exploration of prompt engineering and visual grounding techniques could further improve results.	3d visual question answering, gpt-4, gpt-4 vision, foundation models, zero-shot learning
2405.18801 Report	SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation	Zhenbei Wu, Qiang Wang, Jie Yang	The scarcity of free-hand sketch presents a challenging problem. Despite the emergence of some large-scale sketch datasets, these datasets primarily consist of sketches at the single-object level. There continues to be a lack of large-scale paired datasets for scene sketches. In this paper, we propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch, enabling the transformation of single-object sketches into scene sketches. To accomplish this, we introduce a method for vector sketch captioning and sketch semantic expansion. Additionally, we design a sketch generation network that incorporates a fusion of multi-modal perceptual constraints, suitable for application in zero-shot image-to-sketch downstream task, demonstrating state-of-the-art performance through experimental validation. Finally, leveraging our proposed sketch-to-sketch generation method, we contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent "text-sketch-image" triplets. Our research confirms that this dataset can significantly enhance the capabilities of existing models in sketch-based image retrieval and sketch-controlled image synthesis tasks. We will make our dataset and code publicly available.	This paper proposes a self-supervised method for generating scene sketches from single-object sketches, without relying on existing scene sketch datasets.	Scene sketches are crucial for understanding human visual comprehension and fine-grained design, but current datasets are limited. This method addresses the scarcity of scene sketch data.	The method uses vector sketch captioning to extract semantic information from single-object sketches, expands it using a large image description dataset, and then generates scene sketches using a multi-modal fusion approach with text, image, and sketch constraints.	The method successfully generates scene sketches from single-object sketches, outperforming existing methods in zero-shot image-to-sketch generation. A large-scale dataset "SketchTriplet" is created, containing 1,000,000 "text-sketch-image" triplets with high semantic consistency. Retraining existing models with SketchTriplet significantly improves performance in sketch-based image retrieval and sketch-controlled image synthesis tasks.	The current method doesn't offer control over transparency in the generated sketches. The generated sketches are limited to a single style.	scene sketch generation, sketch-to-sketch, self-supervised learning, multi-modal fusion, dataset creation
2405.18784 Report	LP-3DGS: Learning to Prune 3D Gaussian Splatting	Zhaoliang Zhang, Tianchen Song, Yongjae Lee, Li Yang, Cheng Peng, Rama Chellappa, Deliang Fan	Recently, 3D Gaussian Splatting (3DGS) has become one of the mainstream methodologies for novel view synthesis (NVS) due to its high quality and fast rendering speed. However, as a point-based scene representation, 3DGS potentially generates a large number of Gaussians to fit the scene, leading to high memory usage. Improvements that have been proposed require either an empirical and preset pruning ratio or importance score threshold to prune the point cloud. Such hyperparamter requires multiple rounds of training to optimize and achieve the maximum pruning ratio, while maintaining the rendering quality for each scene. In this work, we propose learning-to-prune 3DGS (LP-3DGS), where a trainable binary mask is applied to the importance score that can find optimal pruning ratio automatically. Instead of using the traditional straight-through estimator (STE) method to approximate the binary mask gradient, we redesign the masking function to leverage the Gumbel-Sigmoid method, making it differentiable and compatible with the existing training process of 3DGS. Extensive experiments have shown that LP-3DGS consistently produces a good balance that is both efficient and high quality.	This paper proposes LP-3DGS, a method for learning to prune Gaussian points in 3D Gaussian Splatting (3DGS) for novel view synthesis.	Existing 3DGS pruning techniques require manual tuning of the pruning ratio, which is time-consuming and potentially suboptimal. LP-3DGS aims to automate this process and find the optimal pruning ratio for each scene.	LP-3DGS utilizes a trainable binary mask, activated by the Gumbel-Sigmoid function, to determine which Gaussians to prune. This mask is applied to existing importance scores or directly to Gaussian parameters. The method integrates this mask learning into the 3DGS training process.	LP-3DGS automatically finds optimal pruning ratios for various scenes, eliminating the need for manual parameter sweeping. The method achieves comparable or better rendering quality with significantly smaller model sizes compared to baselines. LP-3DGS, using a Gumbel-Sigmoid activated mask, outperforms STE-based mask techniques in terms of pruning ratio and rendering quality.	The final rendering quality depends on the effectiveness of the chosen importance score. Future work could explore alternative importance metrics or combine multiple metrics for better pruning.	novel view synthesis, 3d gaussian splatting, model compression, pruning, gumbel-sigmoid
2405.18762 Report	Inpaint Biases: A Pathway to Accurate and Unbiased Image Generation	Jiyoon Myung, Jihyeon Park	This paper examines the limitations of advanced text-to-image models in accurately rendering unconventional concepts which are scarcely represented or absent in their training datasets. We identify how these limitations not only confine the creative potential of these models but also pose risks of reinforcing stereotypes. To address these challenges, we introduce the Inpaint Biases framework, which employs user-defined masks and inpainting techniques to enhance the accuracy of image generation, particularly for novel or inaccurately rendered objects. Through experimental validation, we demonstrate how this framework significantly improves the fidelity of generated images to the user's intent, thereby expanding the models' creative capabilities and mitigating the risk of perpetuating biases. Our study contributes to the advancement of text-to-image models as unbiased, versatile tools for creative expression.	This paper introduces the Inpaint Biases framework to improve the accuracy of text-to-image models in rendering unconventional concepts.	Current text-to-image models struggle to depict concepts not well-represented in their training data, limiting creativity and potentially reinforcing stereotypes.	The framework utilizes user-defined masks, the Segment Anything Model (SAM) for segmentation, Large Language Models (LLMs) for prompt refinement, and inpainting techniques to correct specific areas of generated images.	The framework successfully rendered unconventional concepts like a chocolate river and a polka-dotted cat. Quantitative analysis using CLIP scores confirmed improved alignment between inpainted images and the desired prompts. The framework demonstrates potential in mitigating bias and enhancing the creative capacity of text-to-image models.	The framework currently requires user intervention for mask generation, limiting its autonomy. Future research could explore automated bias detection and correction by the model itself.	text-to-image synthesis, bias mitigation, inpainting, generative ai, segment anything model (sam)
2405.18750 Report	T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback	Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang	Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.	The paper introduces T2V-Turbo, a text-to-video model that integrates reward feedback from a mixture of differentiable reward models, including a video-text model, during consistency distillation to achieve both fast and high-quality video generation.	This work addresses the limitations of existing diffusion-based text-to-video models, which are often slow and struggle to align with human preferences.	T2V-Turbo leverages reward feedback from an image-text reward model and a video-text reward model during the consistency distillation process, optimizing single-step generations to improve visual quality and text-video alignment.	Achieves state-of-the-art results on the VBench benchmark with only 4 inference steps, surpassing even proprietary models like Gen-2 and Pika. Human evaluations show a preference for 4-step T2V-Turbo generations over 50-step samples from teacher models, indicating significant acceleration and quality improvement. Ablation studies demonstrate the importance of both image-text and video-text reward models in enhancing video generation.	Limited availability of open-sourced video-text reward models specifically trained to reflect human preferences. Potential for misuse of realistic synthetic videos, requiring safeguards and ethical guidelines for responsible development and deployment.	text-to-video generation, diffusion models, consistency distillation, reward models, human evaluation
2405.18715 Report	NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild	Weining Ren, Zihan Zhu, Boyang Sun, Jiaqi Chen, Marc Pollefeys, Songyou Peng	Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.	This paper introduces NeRF On-the-go, a method for robustly synthesizing novel views from casually captured images in dynamic scenes by effectively removing distractors.	Existing NeRF methods struggle with dynamic, real-world environments containing distractors (moving objects, changing lighting, etc.), limiting their practical applications.	The method leverages pre-trained DINOv2 features for uncertainty prediction, utilizes a structural similarity loss to enhance uncertainty optimization, and incorporates the predicted uncertainty into a decoupled NeRF training strategy.	NeRF On-the-go achieves high-fidelity novel view synthesis even in complex, in-the-wild scenes with varying distractor ratios. The method significantly outperforms state-of-the-art techniques on both synthetic and real-world datasets. NeRF On-the-go demonstrates significantly faster convergence speed compared to prior art.	The method faces challenges in predicting accurate uncertainty for regions with strong view-dependent effects. The performance degrades with sparse training views.	neural radiance fields, novel view synthesis, distractor removal, uncertainty estimation, dinov2 features
2405.18679 Report	Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain	Juntao Zhang, Kun Bian, Peng Cheng, Wenbo An, Jianning Liu, Jun Zhou	In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \url{https://github.com/yws-wxs/Vim-F}.	This paper proposes Vim-F(H), a novel visual backbone based on State Space Models (SSMs) that incorporates frequency domain scanning and a hybrid patch embedding to enhance the model's ability to capture global spatial relationships and local dependencies.	Vision Mamba (ViM) methods, while promising for modeling long sequences, are not yet fully competitive with traditional CNNs and ViTs due to their limitations in processing 2D image data and capturing global spatial relationships.	The authors introduce frequency domain scanning using Fast Fourier Transform (FFT) to provide a global receptive field during scanning. Additionally, they design a hybrid patch embedding with overlapping and non-overlapping convolutions for better capturing local correlations. These improvements are implemented based on the Vim model, resulting in Vim-F(H).	Vim-F(H) significantly outperforms the baseline Vim model on ImageNet-1K classification, achieving 1.3% and 0.8% higher accuracy for Vim-Ti-F(H) and Vim-S-F(H) respectively. The frequency domain scanning effectively reduces the model's reliance on positional embeddings while maintaining a global receptive field. Vim-F(H) achieves competitive results compared to advanced CNNs, ViTs, and ViMs on object detection and instance segmentation tasks using Mask R-CNN on the COCO dataset.	The effectiveness of the proposed method for ViMs with hybrid encoders has not been fully studied. Further investigation is needed to explore more complex spatial relationships in the frequency domain.	vision mamba, state space models, frequency domain scanning, patch embedding, computer vision
2405.18677 Report	Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering	Ido Sobol, Chenfeng Xu, Or Litany	Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects.	Zero-to-Hero, a test-time technique to address view synthesis artifacts in Zero-1-to-3 through attention map manipulation, enhancing realism and consistency.	Generating realistic images from single source images at arbitrary views is challenging, and existing diffusion models like Zero-1-to-3 have limitations in generating plausible and consistent novel views.	Draws an analogy between denoising and SGD, implementing an attention map filtering mechanism (iterative aggregation and averaging) for robust view generation, enhanced by mutual self-attention for shape guidance and a specialized sampling schedule.	Substantial improvement in fidelity and consistency of generated novel views. Significant improvement across appearance and shape evaluation metrics (PSNR, SSIM, LPIPS, IoU). Robustness to random noise and ability to mitigate artifacts observed in the baseline model.	Performance limited by the pre-trained model's capabilities. Attention filtering, while enhancing realism, may limit generation diversity.	novel view synthesis, diffusion models, attention mechanism, test-time refinement, computer vision
2405.18654 Report	Mitigating Object Hallucination via Data Augmented Contrastive Tuning	Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Ö. Arık, Tomas Pfister	Despite their remarkable progress, Multimodal Large Language Models (MLLMs) tend to hallucinate factually inaccurate information. In this work, we address object hallucinations in MLLMs, where information is offered about an object that is not present in the model input. We introduce a contrastive tuning method that can be applied to a pretrained off-the-shelf MLLM for mitigating hallucinations while preserving its general vision-language capabilities. For a given factual token, we create a hallucinated token through generative data augmentation by selectively altering the ground-truth information. The proposed contrastive tuning is applied at the token level to improve the relative likelihood of the factual token compared to the hallucinated one. Our thorough evaluation confirms the effectiveness of contrastive tuning in mitigating hallucination. Moreover, the proposed contrastive tuning is simple, fast, and requires minimal training with no additional overhead at inference.	Introduces a contrastive tuning method for mitigating object hallucinations in Multimodal Large Language Models (MLLMs) while preserving their general vision-language capabilities.	Object hallucination, where MLLMs generate descriptions of objects not present in the input, hinders their reliability and widespread use.	Generative data augmentation is used to create hallucinated responses by altering ground-truth objects. Contrastive tuning is then applied at the token level to improve the likelihood of factual tokens compared to hallucinated ones. A KL-divergence constraint ensures the MLLM retains its original performance in general vision-language tasks.	HALVA substantially reduces hallucination in image descriptions compared to the base LLaVA model, matching or exceeding the performance of other methods. HALVA significantly improves performance on discriminative tasks related to object attributes, presence, and relations, surpassing existing methods. Contrastive tuning retains or improves the performance of the base LLaVA model on standard vision-language benchmarks, unlike other methods that degrade general task ability.	The current work primarily focuses on mitigating object hallucinations. More research is needed to address other forms of hallucinations in MLLMs. Future work includes generalizing the proposed generative data augmentation and contrastive tuning to other foundation models with accessible weights.	multimodal large language models, hallucination mitigation, contrastive tuning, generative data augmentation, vision-language tasks
2405.18616 Report	Wavelet-Based Image Tokenizer for Vision Transformers	Zhenhai Zhu, Radu Soricut	Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on important new research directions for ViT-based model design, such as image tokens on a non-uniform grid for image understanding.	This paper proposes a novel image tokenizer for Vision Transformer (ViT) models based on wavelet transformation, replacing the conventional patch-wise convolution.	This is crucial as it addresses the limitations of existing patch-convolution tokenizers in handling high-resolution images and their vulnerability to adversarial attacks. The proposed method offers higher efficiency and improved accuracy.	The method leverages the wavelet transformation's ability to compress redundant image information. It introduces pixel-space token embedding using wavelet coefficients and utilizes block sparse projection to map them to semantically meaningful lower-dimensional embeddings.	ViT models with the wavelet-based tokenizer achieve higher training throughput due to reduced embedding size and efficient handling of high-resolution images. The models demonstrate better top-1 precision on the ImageNet validation set compared to those using patch-convolution tokenizers. The inherent properties of wavelet transformation make the tokenizer naturally resistant to adversarial attacks.	The paper primarily focuses on image classification, and further investigation is needed to evaluate the tokenizer's performance on other vision tasks like object detection and semantic segmentation. Future work includes exploring the use of non-uniform image partitioning guided by the sparsity of wavelet coefficients to further enhance the tokenizer's efficiency.	vision transformer, image tokenizer, wavelet transformation, image compression, adversarial robustness
2405.18525 Report	REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment	Haonan Han, Rui Yang, Huan Liao, Jiankai Xing, Zunnan Xu, Xiaoming Yu, Junwei Zha, Xiu Li, Wanhua Li	Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using off-the-shelf image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based long-range appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images.	REPARO is a novel two-step approach for generating compositional 3D assets from single images by first reconstructing individual objects and then refining their layout through differentiable rendering with a long-range appearance loss and a high-level semantic loss.	Existing image-to-3D models struggle to accurately reconstruct multi-object scenes due to center bias and occlusion complexities, making it challenging to generate interactive and realistic multi-object environments.	REPARO first extracts and reconstructs 3D meshes for individual objects from a single image. Then, it uses differentiable rendering with optimal transport-based long-range appearance loss and high-level semantic loss to optimize the layout of these meshes, ensuring a coherent scene composition.	REPARO significantly outperforms existing image-to-3D models in generating compositional 3D scenes, as demonstrated by quantitative metrics (CLIP score, PSNR, SSIM, LPIPS) on the GSO dataset. The use of long-range appearance loss with optimal transport enables REPARO to effectively align the layout of reconstructed objects with the reference image. A user study confirms that REPARO generates more realistic multi-object 3D assets compared to baseline models, as evidenced by its higher preference score.	The evaluation of multi-object 3D assets reveals a discrepancy between quantitative and qualitative results, suggesting the need for improved evaluation methods in future research. The current implementation of REPARO relies on pre-trained 2D foundation models for segmentation, inpainting, and depth estimation, which could potentially limit its generalization ability.	3d scene generation, compositional 3d assets, differentiable rendering, layout alignment, optimal transport
2405.18524 Report	Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures	Hongjun Wu, Li Xiao, Xingkuo Zhang, Yining Miao	Knowledge distillation is commonly employed to compress neural networks, reducing the inference costs and memory footprint. In the scenario of homogenous architecture, feature-based methods have been widely validated for their effectiveness. However, in scenarios where the teacher and student models are of heterogeneous architectures, the inherent differences in feature representation significantly degrade the performance of these methods. Recent studies have highlighted that low-frequency components constitute the majority of image features. Motivated by this, we propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation between heterogeneous architectures. Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models, aligning them in a compact space to overcome architectural disparities. Moreover, leveraging the intrinsic pairing characteristic of the teacher-student framework, we design an innovative sample-level contrastive learning framework that adeptly restructures the constraints of within-sample feature similarity and between-sample feature divergence into a contrastive learning task. This strategy enables the student model to capitalize on intra-sample feature congruence while simultaneously enhancing the discrimination of features among disparate samples. Consequently, our LFCC framework accurately captures the commonalities in feature representation across heterogeneous architectures. Extensive evaluations and empirical analyses across three architectures (CNNs, Transformers, and MLPs) demonstrate that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100. All codes will be publicly available.	This paper proposes LFCC, a Low-Frequency Components-based Contrastive Knowledge Distillation framework to improve feature-based knowledge distillation in heterogeneous architectures.	Feature-based distillation often underperforms in heterogeneous settings due to significant differences in feature representations between architectures. This limits the potential teacher-student pairings and hinders knowledge transfer.	LFCC uses multi-scale low-pass filters to extract and align low-frequency components of teacher and student features in a compact space. It also employs sample-level contrastive learning to enhance feature discrimination between different samples.	LFCC outperforms state-of-the-art methods on ImageNet-1K and CIFAR-100 for most teacher-student pairings. The method effectively identifies commonalities in feature representations across diverse architectures. Ablation studies confirm the contribution of each component in LFCC.	Logit-based methods still outperform feature-based methods on small datasets like CIFAR-100, especially with Transformer or MLP students. Future work could explore alternative low-pass filter designs and contrastive learning strategies.	knowledge distillation, heterogeneous architectures, low-frequency components, contrastive learning, feature alignment
2405.18515 Report	Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication	Yunuo Chen, Tianyi Xie, Zeshun Zong, Xuan Li, Feng Gao, Yin Yang, Ying Nian Wu, Chenfanfu Jiang	Existing diffusion-based text-to-3D generation methods primarily focus on producing visually realistic shapes and appearances, often neglecting the physical constraints necessary for downstream tasks. Generated models frequently fail to maintain balance when placed in physics-based simulations or 3D printed. This balance is crucial for satisfying user design intentions in interactive gaming, embodied AI, and robotics, where stable models are needed for reliable interaction. Additionally, stable models ensure that 3D-printed objects, such as figurines for home decoration, can stand on their own without requiring additional supports. To fill this gap, we introduce Atlas3D, an automatic and easy-to-implement method that enhances existing Score Distillation Sampling (SDS)-based text-to-3D tools. Atlas3D ensures the generation of self-supporting 3D models that adhere to physical laws of stability under gravity, contact, and friction. Our approach combines a novel differentiable simulation-based loss function with physically inspired regularization, serving as either a refinement or a post-processing module for existing frameworks. We verify Atlas3D's efficacy through extensive generation tasks and validate the resulting 3D models in both simulated and real-world environments.	Atlas3D is a novel method that integrates physics-based constraints into existing text-to-3D generation models, enabling the generation of self-supporting 3D models suitable for simulation and 3D printing.	Existing text-to-3D methods often neglect physical plausibility, resulting in models that lack standability. This is crucial for applications like gaming, robotics, and 3D printing where stability is essential.	Atlas3D incorporates differentiable physics simulations and physically-inspired regularizations into the generation process. It leverages standability and stable equilibrium loss functions during training to ensure generated models are self-supporting. This method can be integrated into existing text-to-3D frameworks as a refinement or post-processing step.	Atlas3D generates self-supporting 3D models that remain stable in physics simulations, outperforming baseline models in stability tests. The generated models exhibit robustness to perturbations, successfully standing even with small initial rotations. The method's effectiveness is validated through real-world 3D printing, with printed models demonstrating superior standability compared to those generated without physics constraints.	The optimization process currently allows for a large degree of freedom in mesh vertex adjustments, potentially leading to undesirable distortions. The current framework focuses on SDS-based methods. Future work could explore generalizing to non-SDS or non-diffusion based methods.	text-to-3d generation, physics-based simulation, 3d printing, stable equilibrium, differentiable rendering
2405.18428 Report	DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention	Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, Xinggang Wang	Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with scalability and quadratic complexity efficiency. In this paper, we aim to leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models. We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the DiT design, but offering superior efficiency and effectiveness. In addition to better performance than DiT, DiG-S/2 exhibits $2.5\times$ higher training speed than DiT-S/2 and saves $75.7\%$ GPU memory at a resolution of $1792 \times 1792$. Moreover, we analyze the scalability of DiG across a variety of computational complexity. DiG models, with increased depth/width or augmentation of input tokens, consistently exhibit decreasing FID. We further compare DiG with other subquadratic-time diffusion models. With the same model size, DiG-XL/2 is $4.2\times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8\times$ faster than DiT with CUDA-optimized FlashAttention-2 under the $2048$ resolution. All these results demonstrate its superior efficiency among the latest diffusion models. Code is released at https://github.com/hustvl/DiG.	Introduces Diffusion Gated Linear Attention Transformers (DiG), a more efficient and effective alternative to Diffusion Transformers (DiT) for visual content generation.	Addresses the scalability and quadratic complexity limitations of existing Diffusion Transformer (DiT) models in image generation.	Leverages the long sequence modeling capability of Gated Linear Attention (GLA) Transformers within a diffusion model framework, closely following the design of DiT.	Achieves better performance than DiT with significantly faster training (2.5x) and reduced memory footprint (75.7% reduction). Demonstrates strong scalability with consistent FID improvement as model depth/width or input tokens increase. Outperforms other subquadratic-time diffusion models in terms of speed, being 4.2x faster than Mamba-based models and 1.8x faster than DiT with FlashAttention-2.	Exploration of alternative linear attention mechanisms for further efficiency gains. Investigating the application of DiG to other generative modeling tasks beyond image generation.	diffusion models, image generation, gated linear attention, transformers, scalability
2405.18425 Report	ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention	Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang	Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.	This paper introduces ViG, a novel vision backbone network leveraging Gated Linear Attention (GLA) for efficient and accurate visual representation learning.	Existing methods like Vision Transformers, while effective, suffer from quadratic complexity, hindering their application to high-resolution images. Linear complexity alternatives often lack global context or face practical efficiency challenges. ViG addresses these limitations by combining the efficiency of linear complexity with global receptive field capture.	ViG introduces three key innovations: a) a Bidirectional Gated Linear Attention (BiGLA) layer for capturing global 1D context, b) a direction-wise gating mechanism within BiGLA to select context from different directions, and c) a 2D gating locality injection mechanism to integrate 2D local information. A hardware-aware implementation further boosts efficiency.	ViG achieves superior accuracy and parameter efficiency compared to non-hierarchical and hierarchical models on ImageNet. In downstream tasks like object detection and semantic segmentation, ViG consistently outperforms ViT and VRWKV with lower computational cost. ViG exhibits superior resolution extrapolation capability, outperforming ViT, Vim, VRWKV, and ResNet50 in accuracy as image resolution increases.	While ViG with hardware-aware implementation demonstrates efficiency improvements, it remains marginally slower than DeiT for small 224x224 images, requiring further optimization. Future work will explore adapting and extending ViG for other vision tasks beyond classification, detection, and segmentation.	vision transformer, gated linear attention, linear complexity, global receptive field, visual representation learning
2405.18424 Report	3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting	Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, Ceyuan Yang	Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing. Code and online demo can be found at our project homepage: https://zqh0253.github.io/3DitScene/.	\method is a novel scene editing framework leveraging language-guided disentangled Gaussian Splatting, enabling seamless 2D-to-3D editing and granular control over scene composition and objects.	Existing scene editing methods are limited to either 2D object or 3D global scene manipulation, lacking a unified approach for precise control at different levels.	\method refines 3D Gaussians projected from a single image using generative priors and optimization, then distills CLIP language features for object disentanglement.	Enables simultaneous 2D and 3D editing, including object manipulation and novel view synthesis. Outperforms baselines in user studies and qualitative comparisons regarding editing flexibility and consistency. Disentangled scene representation improves optimization by allowing object-level layout augmentation.	Object manipulation evaluation is challenging due to varying coordinate systems across methods. Further exploration of user interaction methods for intuitive scene manipulation.	image editing, 3d scene generation, gaussian splatting, language guidance, scene disentanglement
2405.18416 Report	3D StreetUnveiler with Semantic-Aware 2DGS	Jingwei Xu, Yikai Wang, Yiqun Zhao, Yanwei Fu, Shenghua Gao	Unveiling an empty street from crowded observations captured by in-car cameras is crucial for autonomous driving. However, removing all temporary static objects, such as stopped vehicles and standing pedestrians, presents a significant challenge. Unlike object-centric 3D inpainting, which relies on thorough observation in a small scene, street scenes involve long trajectories that differ from previous 3D inpainting tasks. The camera-centric moving environment of captured videos further complicates the task due to the limited degree and time duration of object observation. To address these obstacles, we introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns a 3D representation of the empty street from crowded observations. Our representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) for its scalability and ability to identify Gaussians to be removed. We inpaint rendered image after removing unwanted Gaussians to provide pseudo-labels and subsequently re-optimize the 2DGS. Given its temporal continuous movement, we divide the empty street scene into observed, partial-observed, and unobserved regions, which we propose to locate through a rendered alpha map. This decomposition helps us to minimize the regions that need to be inpainted. To enhance the temporal consistency of the inpainting, we introduce a novel time-reversal framework to inpaint frames in reverse order and use later frames as references for earlier frames to fully utilize the long-trajectory observations. Our experiments conducted on the street scene dataset successfully reconstructed a 3D representation of the empty street. The mesh representation of the empty street can be extracted for further applications. Project page and more visualizations can be found at: https://streetunveiler.github.io	StreetUnveiler, a novel method for reconstructing an empty 3D street scene from in-car camera videos by removing temporary static objects like cars and pedestrians.	Crucial for autonomous driving by providing realistic simulations of empty street environments, which is seldom studied due to the challenges in handling long camera trajectories and the lack of ground-truth data for training.	Uses hard-label semantic 2D Gaussian Splatting (2DGS) for scene representation and proposes a time-reversal inpainting framework to maintain consistency across different viewpoints in long video sequences.	Achieves accurate removal of static objects from street scenes, reconstructing empty environments with high fidelity. Outperforms existing 3D inpainting methods in terms of appearance quality, as measured by LPIPS and FID scores. Successfully extracts a clean and realistic mesh of the empty street scene using TSDF fusion.	Relies on the accuracy of the 2D semantic segmentation model for reliable object removal. Computational cost grows linearly with the number of video frames due to the per-frame inpainting process.	3d scene reconstruction, street scene understanding, object removal, gaussian splatting, time-reversal inpainting
2405.18415 Report	Why are Visually-Grounded Language Models Bad at Image Classification?	Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy	Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.	This paper investigates the use of visually-grounded language models (VLMs) for image classification and finds that they significantly underperform compared to specialized image classifiers like CLIP, despite often using CLIP as a vision encoder.	Image classification is a fundamental aspect of machine vision, and understanding why VLMs struggle with this task is crucial for improving their overall visual intelligence and enabling them to tackle more complex visual tasks like visual question answering.	The authors evaluate various VLMs on standard image classification benchmarks and analyze their performance in different settings, exploring hypotheses related to inference algorithms, training objectives, and data used during VLM training.	VLMs exhibit significantly lower accuracy in image classification compared to CLIP models, even when provided with class names as context. The primary cause of poor classification performance in VLMs is attributed to the training data, specifically the insufficient exposure to diverse classes and lack of classification-focused data. Integrating classification-focused datasets into VLM training enhances both their classification accuracy and general capabilities, leading to improved performance on tasks like visual question answering.	The study is limited by the computational cost of evaluating all possible VLMs and datasets, focusing on two representative VLM architectures and four datasets. Future work could explore zero-shot methods to decode the classification information encoded in the VLM's latent space without extensive fine-tuning.	visually-grounded language models, image classification, vlm analysis, data-centric ai, visual question answering
2405.18407 Report	Phased Consistency Model	Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Hongsheng Li, Xiaogang Wang	The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.	This paper proposes Phased Consistency Model (PCM), generalizing the design of Latent Consistency Models (LCM) to accelerate high-resolution text-conditioned image and video generation in latent diffusion models.	LCMs, aimed at accelerating diffusion model generation, are limited in quality and efficiency for high-resolution, text-conditioned synthesis. PCM tackles these limitations.	PCM phases the ODE trajectory into sub-trajectories, enforcing self-consistency within each, allowing for deterministic sampling. It removes CFG from distillation to improve controllability and introduces an adversarial loss to enhance low-step generation.	PCM significantly outperforms LCM across 1-16 step generation settings, achieving state-of-the-art few-step generation. PCM achieves superior or comparable 1-step generation quality compared to specialized 1-step methods. PCM's methodology is successfully applied to video generation, resulting in state-of-the-art few-step text-to-video synthesis.	While improved, generation quality can be unstable at very low step counts (e.g., one-step). Future work includes exploring architectural improvements for enhanced efficiency and control.	generative models, diffusion models, consistency models, text-to-image synthesis, text-to-video synthesis
2405.18406 Report	RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives	Jaehong Yoon, Shoubin Yu, Mohit Bansal	Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. It supports the addition of video objects, inpainting, and attribute modification within a unified framework, surpassing existing video editing and inpainting benchmarks. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.	Presents RACCOON, a user-friendly video-to-paragraph-to-video framework enabling video content editing (removal, addition, modification) via auto-generated narratives, eliminating the need for manual text prompts.	Existing video editing models require labor-intensive textual descriptions of videos, limiting flexibility and user-friendliness for personal video editing.	Two stages: (1) Video-to-Paragraph (V2P): Uses a multimodal LLM with multi-granular spatiotemporal pooling to generate detailed video descriptions capturing holistic context and object details. (2) Paragraph-to-Video (P2V): Leverages user-modified auto-generated descriptions to guide a video diffusion model for editing (adding, removing, changing objects) via inpainting.	Achieves up to 9.4% improvement in human evaluations for V2P compared to baselines, demonstrating superior video description quality. Outperforms previous video editing methods with a relative 49.7% improvement in FVD, indicating better video quality and adherence to textual instructions. Demonstrates the ability to enhance state-of-the-art video generation models by providing detailed auto-generated prompts.	Performance depends on the quality of employed pre-trained backbones (LLM, inpainting model, video diffusion model). Potential for inaccuracies or hallucinations in generated text outputs, inheriting biases from training data.	video editing, video generation, video captioning, multimodal learning, large language models
2405.18361 Report	Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?	Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang	Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.	This paper introduces Atlas, a 3D-tokenized Large Language Model (LLM) framework for reliable autonomous driving, which addresses the limitations of 2D-tokenized LLMs in accurately perceiving 3D environments.	Accurately perceiving the 3D environment is crucial for reliable autonomous driving planning. Existing VLM-based approaches often rely on 2D tokenizers, which lack the inherent 3D geometric priors.	The authors replace 2D vision tokenizers with DETR-style 3D perception models (StreamPETR and TopoMLP) as 3D tokenizers, connecting them to an LLM (Vicuna) via a linear projector. The model is evaluated on the nuScenes dataset for various tasks like 3D detection, lane detection, and planning.	2D-tokenized LLMs show significantly lower performance than task-specific models in 3D perception tasks like object detection and lane detection. Atlas, with 3D tokenizers, achieves superior performance in both 3D perception and open-loop planning, surpassing state-of-the-art BEV-based methods. The study highlights the importance of 3D priors, resolution, and temporal modeling in autonomous driving, demonstrating the effectiveness of using 3D tokenizers in VLM-based approaches.	The model is only evaluated on the open-loop nuScenes dataset and needs further testing in closed-loop environments. The paper lacks direct performance comparison with other VLM-based AD methods due to the unavailability of their code.	autonomous driving, vision-language models, 3d perception, motion planning, large language models
2405.18326 Report	VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers	Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang	Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.	This paper proposes VITON-DiT, the first Diffusion Transformer (DiT)-based video try-on network capable of generating temporally consistent try-on videos in real-world scenarios with complex poses and backgrounds.	Existing video try-on methods are limited to product images, short video generation, and struggle with complex scenes. This work leverages the power of DiT for realistic and scalable video try-on.	VITON-DiT integrates a spatio-temporal denoising DiT, a garment extractor, and an ID ControlNet connected by an attention fusion mechanism. It also employs a random selection training strategy and an Interpolated Auto-Regressive (IAR) technique for long video generation.	VITON-DiT outperforms previous state-of-the-art methods in terms of spatio-temporal consistency on a challenging benchmark dataset. The proposed attention fusion mechanism effectively preserves garment details during video generation. The model demonstrates strong data scalability, with performance improving as the quantity and quality of unpaired training data increases.	While demonstrating strong performance on complex scenes, VITON-DiT's quantitative scores on product clothing images are slightly lower than some baselines trained on similar paired datasets. The computational cost of DiT-based models remains high, presenting challenges for real-time applications.	video try-on, diffusion models, diffusion transformers, unpaired learning, computer vision
2405.18304 Report	Multi-modal Generation via Cross-Modal In-Context Learning	Amandeep Kumar, Muzammal Naseer, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal	In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from lengthy prompts and maintain contextual coherence within prompt sequences. Moreover, they often result in misaligned image generation for prompt sequences featuring multiple objects. To address this, we propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences by leveraging the combined capabilities of large language models (LLMs) and diffusion models. Our MGCC comprises a novel Cross-Modal Refinement module to explicitly learn cross-modal dependencies between the text and image in the LLM embedding space, and a contextual object grounding module to generate object bounding boxes specifically targeting scenes with multiple objects. Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts. Experimental evaluations on two benchmark datasets, demonstrate the effectiveness of our method. On Visual Story Generation (VIST) dataset with multimodal inputs, our MGCC achieves a CLIP Similarity score of $0.652$ compared to SOTA GILL $0.641$. Similarly, on Visual Dialogue Context (VisDial) having lengthy dialogue sequences, our MGCC achieves an impressive CLIP score of $0.660$, largely outperforming existing SOTA method scoring $0.645$. Code: https://github.com/VIROBO-15/MGCC	This paper introduces MGCC, a novel method for generating images from complex multimodal prompt sequences, addressing limitations of existing text-to-image models in capturing fine-grained details and maintaining contextual coherence.	Existing methods struggle to generate accurate images from lengthy prompts or sequences, particularly in capturing fine-grained details and maintaining context, especially with multiple objects. MGCC aims to overcome these limitations.	MGCC leverages LLMs and diffusion models with two key components: a Cross-Modal Refinement Module to learn cross-modal dependencies in LLM embedding space, and a contextual object grounding module to generate bounding boxes for precise object control in generated images.	MGCC achieves state-of-the-art performance on VIST and VisDial datasets, demonstrating its ability to handle lengthy, complex multimodal prompts. The Cross-Modal Refinement Module significantly improves image quality and alignment with prompts by learning cross-modal dependencies. The contextual object grounding module enhances object details and count accuracy in generated images.	The model's performance with very short dialogues needs improvement. Future work includes exploring alternative prompting strategies for contextual object grounding to further enhance control and flexibility	multimodal generation, cross-modal learning, in-context learning, object grounding, text-to-image synthesis
2405.18295 Report	Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan	In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization.	This paper introduces 3D Intention Grounding (3D-IG), a new task for detecting desired objects in 3D scenes using human intention expressed in free-form text, moving beyond traditional 3D visual grounding reliant on specific object references.	3D-IG addresses the limitations of existing 3D object detection methods that rely on explicit object references, aiming to enable AI agents to automatically reason and detect targets based solely on human intention, crucial for real-world scenarios where providing specific instructions might be challenging.	The authors create a new dataset, Intent3D, containing 44,990 intention texts linked to 209 object classes from 1,042 ScanNet scenes. They establish baselines using existing language-based 3D object detection methods and propose a novel method, IntentNet, incorporating candidate box matching, verb-object alignment, and cascaded adaptive learning for improved intention understanding and object detection.	IntentNet significantly outperforms all baseline methods on the Intent3D benchmark, demonstrating the effectiveness of its components in understanding intention and detecting targets. Existing expert models, primarily designed for referential language, struggle with the nuanced nature of intention language. LLM-based models, while showing potential, currently face challenges in 3D visual grounding and exhibit limitations due to hallucinations and data scarcity.	The reliance on GPT-4 for intention text generation introduces potential subjectivity based on its training data and limits the scalability of dataset creation. The current work focuses on single-intention scenarios. Future work could explore grounding multiple intentions within a single scene, increasing task complexity.	3d object detection, intention grounding, visual grounding, 3d vision, language and vision
2405.18172 Report	AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario	Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, Bingbing Ni	While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a lightweight, scalable, operator known as Hydra Block for attire combinations. This is achieved through a parallel attention mechanism that facilitates the feature injection of multiple garments from conditionally encoded branches into the main network. Secondly, to significantly enhance the model's robustness and expressiveness in real-world scenarios, we evolve its potential across diverse settings by synthesizing the residuals of multiple models, as well as implementing a mask region boost strategy to overcome the instability caused by information leakage in existing models. Equipped with the above design, AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap, excelling in producing well-fitting garments replete with photorealistic and rich details. Furthermore, AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.	AnyFit, a novel image-based virtual try-on method that excels in generating high-fidelity, robust outfit combinations across diverse scenarios.	Existing VTON methods fall short in producing realistic and detailed try-on images, especially for multiple garments and complex real-world scenes.	AnyFit introduces HydraNet with parallelized attention for multi-garment encoding and employs Prior Model Evolution (merging weights of multiple pre-trained models) and Adaptive Mask Boost (mask augmentation and adaptive elongation) for enhanced robustness.	AnyFit significantly outperforms previous state-of-the-art methods on benchmarks like VITON-HD and DressCode, as well as on a challenging proprietary dataset. HydraNet enables accurate and scalable multi-garment try-ons, effectively handling transitions between garments. Prior Model Evolution and Adaptive Mask Boost significantly improve the robustness of the generated try-on images, particularly in complex real-world scenarios.	AnyFit may exhibit instability in generating complex hand structures, reflecting limitations of the underlying text-to-image model. Text-based control of try-on style, while showing promise, remains an area for further development.	virtual try-on, vton, diffusion models, image generation, multi-condition generation
2405.18163 Report	NegGS: Negative Gaussian Splatting	Artur Kasymov, Bartosz Czekaj, Marcin Mazur, Jacek Tabor, Przemysław Spurek	One of the key advantages of 3D rendering is its ability to simulate intricate scenes accurately. One of the most widely used methods for this purpose is Gaussian Splatting, a novel approach that is known for its rapid training and inference capabilities. In essence, Gaussian Splatting involves incorporating data about the 3D objects of interest into a series of Gaussian distributions, each of which can then be depicted in 3D in a manner analogous to traditional meshes. It is regrettable that the use of Gaussians in Gaussian Splatting is currently somewhat restrictive due to their perceived linear nature. In practice, 3D objects are often composed of complex curves and highly nonlinear structures. This issue can to some extent be alleviated by employing a multitude of Gaussian components to reflect the complex, nonlinear structures accurately. However, this approach results in a considerable increase in time complexity. This paper introduces the concept of negative Gaussians, which are interpreted as items with negative colors. The rationale behind this approach is based on the density distribution created by dividing the probability density functions (PDFs) of two Gaussians, which we refer to as Diff-Gaussian. Such a distribution can be used to approximate structures such as donut and moon-shaped datasets. Experimental findings indicate that the application of these techniques enhances the modeling of high-frequency elements with rapid color transitions. Additionally, it improves the representation of shadows. To the best of our knowledge, this is the first paper to extend the simple elipsoid shapes of Gaussian Splatting to more complex nonlinear structures.	Introduces Negative Gaussian Splatting (NegGS), using "negative Gaussians" with negative colors to represent complex 3D scenes, enhancing detail and shadow representation in Gaussian Splatting.	Addresses limitations of Gaussian Splatting in modeling intricate curves and non-linear structures present in real-world 3D objects, enhancing rendering quality, particularly in scenes with small elements and varying lighting.	Extends the Gaussian Splatting algorithm by incorporating negative Gaussians into the optimization process, allowing for more complex shapes by strategically canceling out portions of positive Gaussians.	NegGS achieves superior rendering quality compared to existing methods on datasets with complex lighting and small details (e.g., Tanks and Temples). It effectively models high-frequency elements with rapid color and light transitions, as seen in results on synthetic datasets. The method accurately approximates shadows, particularly for smaller elements, thanks to the use of negative Gaussian components.	While effective for specific regions with complex details, NegGS yields comparable results to Gaussian Splatting on simpler shapes. The study doesn't directly employ Diff-Gaussian distributions, instead integrating negative Gaussians separately, leaving room for further exploration of direct Diff-Gaussian implementation.	3d rendering, gaussian splatting, negative gaussians, diff-gaussian distribution, shadow rendering
2405.18156 Report	VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation	Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu	Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.	VividPose, a novel end-to-end human image animation pipeline based on Stable Video Diffusion (SVD), that enhances temporal consistency and handles diverse body shapes and hand movements.	Existing methods often lead to appearance degradation, temporal inconsistencies, and shape misalignment in generated videos.	VividPose leverages SVD with an identity-aware appearance controller (integrating facial information for identity retention) and a geometry-aware pose controller (using dense rendering maps from SMPL-X and sparse skeleton maps for accurate pose and shape alignment).	VividPose achieves state-of-the-art results in temporal consistency, visual fidelity, and generalization ability on UBCFashion and TikTok benchmarks. The identity-aware appearance controller significantly improves facial identity retention during animation. The geometry-aware pose controller ensures accurate body shape generation and effectively handles complex hand movements.	The reliance on pretrained models like SVD and SMPL-X may limit the flexibility in handling novel or highly stylized human appearances. Future work includes exploring more efficient training and inference strategies to enhance the practical applicability of VividPose.	human image animation, stable video diffusion, identity-aware appearance control, geometry-aware pose control, smpl-x
2405.18132 Report	EG4D: Explicit Generation of 4D Object without Score Distillation	Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, Houqiang Li	In recent years, the increasing demand for dynamic 3D assets in design and gaming applications has given rise to powerful generative pipelines capable of synthesizing high-quality 4D objects. Previous methods generally rely on score distillation sampling (SDS) algorithm to infer the unseen views and motion of 4D objects, thus leading to unsatisfactory results with defects like over-saturation and Janus problem. Therefore, inspired by recent progress of video diffusion models, we propose to optimize a 4D representation by explicitly generating multi-view videos from one input image. However, it is far from trivial to handle practical challenges faced by such a pipeline, including dramatic temporal inconsistency, inter-frame geometry and texture diversity, and semantic defects brought by video generation results. To address these issues, we propose DG4D, a novel multi-stage framework that generates high-quality and consistent 4D assets without score distillation. Specifically, collaborative techniques and solutions are developed, including an attention injection strategy to synthesize temporal-consistent multi-view videos, a robust and efficient dynamic reconstruction method based on Gaussian Splatting, and a refinement stage with diffusion prior for semantic restoration. The qualitative results and user preference study demonstrate that our framework outperforms the baselines in generation quality by a considerable margin. Code will be released at \url{https://github.com/jasongzy/EG4D}.	This paper proposes EG4D, a novel multi-stage framework that explicitly generates 4D videos from a single image and then reconstructs consistent and high-quality 4D assets without relying on score distillation sampling.	Previous 4D generation methods suffer from issues like over-saturation and Janus problem due to their reliance on score distillation sampling. EG4D overcomes these limitations by leveraging the power of video diffusion models for explicit 4D video generation and reconstruction.	EG4D employs a three-stage pipeline: 1) View and Dynamic Generation: utilizes Stable Video Diffusion (SVD) and SV3D with an attention injection mechanism to generate temporally consistent multi-view videos. 2) Coarse Reconstruction: optimizes a 4D Gaussian Splatting (4D-GS) representation with color transformation to address texture inconsistencies. 3) Diffusion Refinement: leverages image-to-image diffusion models to enhance semantic details and refine the 4D representation.	EG4D generates 4D assets with superior image-4D alignment and more realistic 3D appearance compared to baselines. Quantitative results demonstrate that EG4D achieves the highest CLIP-I score, indicating higher semantic similarity between rendered images and the reference image. User study confirms an overwhelming preference for 4D objects generated by EG4D, highlighting its advantage in overall quality, view consistency, 3D appearance, and motion realism.	Limited capability of base image-to-video models and the consistency-motion trade-off in attention injection restrict the generation of high-dynamic motions. Inaccurate camera pose conditioning in the multi-view diffusion model impacts reconstruction quality. Future work can explore advanced video diffusion models and adaptive camera pose techniques.	4d generation, video diffusion models, gaussian splatting, attention injection, diffusion refinement
2405.18029 Report	Are Image Distributions Indistinguishable to Humans Indistinguishable to Classifiers?	Zebin You, Xinyu Zhang, Hanzhong Guo, Jingdong Wang, Chongxuan Li	The ultimate goal of generative models is to characterize the data distribution perfectly. For image generation, common metrics of visual quality (e.g., FID), and the truthlikeness of generated images to the human eyes seem to suggest that we are close to achieving it. However, through distribution classification tasks, we find that, in the eyes of classifiers parameterized by neural networks, the strongest diffusion models are still far from this goal. Specifically, classifiers consistently and effortlessly distinguish between real and generated images in various settings. Further, we observe an intriguing discrepancy: classifiers can identify differences between diffusion models with similar performance (e.g., U-ViT-H vs. DiT-XL), but struggle to differentiate between the smallest and largest models in the same family (e.g., EDM2-XS vs. EDM2-XXL), whereas humans exhibit the opposite tendency. As an explanation, our comprehensive empirical study suggests that, unlike humans, classifiers tend to classify images through edge and high-frequency components. We believe that our methodology can serve as a probe to understand how generative models work and inspire further thought on how existing models can be improved and how the abuse of such models can be prevented.	This paper investigates the discrepancy between the perceived high quality of images generated by diffusion models and their actual distribution mismatch with real images, as revealed by neural network classifiers.	This work is important because it challenges the assumption that low FID scores and visually appealing results equate to accurate distribution learning in generative models.	The authors propose "distribution classification tasks" where classifiers are trained to distinguish between real images and those generated by various diffusion models. They analyze classification accuracy across different model architectures, dataset combinations, cropping strategies, and frequency components.	Classifiers consistently achieve high accuracy in distinguishing real from generated images across various settings, even with limited training data and when using self-supervised features. Classifiers are more sensitive to inductive biases of different diffusion models than humans, excelling at distinguishing models with similar FID scores but different architectures, while struggling with models within the same family. Classifiers primarily rely on edge information and high-frequency components for classification, maintaining high accuracy even when only a small portion of the image or specific frequency bands are available.	Findings are based on specific datasets and model architectures, potentially limiting generalizability. The paper might unintentionally encourage the development of more sophisticated image generation techniques that could be misused for creating harder-to-detect deepfakes.	diffusion models, generative models, image generation, distribution classification, frequency analysis
2405.18025 Report	Unveiling the Power of Diffusion Features For Personalized Segmentation and Retrieval	Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, Gal Chechik	Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.	This paper presents PDM, a novel zero-shot approach leveraging pre-trained Stable Diffusion features for personalized image retrieval and segmentation.	Personalized retrieval and segmentation are important for various applications, but existing methods struggle to differentiate instances within the same class. This work explores the untapped potential of text-to-image diffusion models for these tasks.	PDM extracts both appearance and semantic features from a specific layer and block within Stable Diffusion. Appearance similarity is calculated using a dot product between masked reference and target feature maps, while semantic similarity utilizes a score map between the class name token and target semantic features. These similarities are combined for retrieval ranking and segmentation.	PDM outperforms state-of-the-art self-supervised and supervised methods on personalized image segmentation benchmarks, demonstrating its ability to accurately segment specific instances. For personalized retrieval, PDM surpasses existing self-supervised and weakly-supervised techniques, achieving comparable results to supervised approaches, even on challenging benchmarks with multiple instances per class. The authors introduce new benchmarks (PerMIR and PerMIS) for personalized retrieval and segmentation with multiple instances from the same object class, addressing limitations in current datasets.	PDM relies on image inversion for feature extraction, making its performance dependent on the quality of image reconstruction. Future work can explore optimizing the speed and efficiency of the feature extraction process	personalized image retrieval, personalized image segmentation, text-to-image diffusion models, stable diffusion, zero-shot learning
2405.17991 Report	VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections	Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng	Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.	This paper proposes VeLoRA, a novel memory-efficient training and fine-tuning algorithm for large neural networks, especially LLMs, by compressing intermediate activations using fixed rank-1 projections of sub-tokens.	Training and fine-tuning large language models (LLMs) demand significant computational and memory resources, hindering broader accessibility and research. This work addresses this bottleneck by compressing intermediate activations for efficient gradient computation, enabling training with limited memory.	VeLoRA divides input tokens into smaller sub-tokens and projects them onto a fixed one-dimensional subspace during the forward pass using a single, cheaply initialized projection vector. During backpropagation, a coarse reconstruction is performed for gradient calculation, significantly reducing the memory footprint.	VeLoRA improves performance on VTAB-1k by 1.5 percentage points while lowering memory requirements compared to full fine-tuning and outperforms existing PEFT methods in terms of memory efficiency and/or accuracy. On the GLUE benchmark using RoBERTa-Base, VeLoRA achieves the best overall results with significant memory improvements, outperforming both LoRA and GaLore. VeLoRA demonstrates superior performance compared to QLoRA when fine-tuning LLaMA models on the Alpaca dataset, achieving higher accuracy while further reducing the memory footprint.	The current study primarily focuses on Transformer models. Further research is needed to assess VeLoRA's applicability and effectiveness on other deep learning architectures, such as CNNs, RNNs, and SSMs. While VeLoRA significantly reduces the memory footprint, the training time remains a challenge. Future work could explore techniques to further accelerate the training process without compromising accuracy.	large language models, memory-efficient training, parameter-efficient fine-tuning (peft), activation compression, gradient sparsification
2405.17965 Report	AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization	Junjie Shentu, Matthew Watson, Noura Al Moubayed	With the unprecedented performance being achieved by text-to-image (T2I) diffusion models, T2I customization further empowers users to tailor the diffusion model to new concepts absent in the pre-training dataset, termed subject-driven generation. Moreover, extracting several new concepts from a single image enables the model to learn multiple concepts, and simultaneously decreases the difficulties of training data preparation, urging the disentanglement of multiple concepts to be a new challenge. However, existing models for disentanglement commonly require pre-determined masks or retain background elements. To this end, we propose an attention-guided method, AttenCraft, for multiple concept disentanglement. In particular, our method leverages self-attention and cross-attention maps to create accurate masks for each concept within a single initialization step, omitting any required mask preparation by humans or other models. The created masks are then applied to guide the cross-attention activation of each target concept during training and achieve concept disentanglement. Additionally, we introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts, and improve generation quality. Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment. Finally, we showcase the applicability of AttenCraft to more complicated settings, such as an input image containing three concepts. The project is available at https://github.com/junjie-shentu/AttenCraft.	This paper introduces AttenCraft, a novel method for disentangling multiple concepts from a single image in text-to-image customization, enabling subject-driven generation with multiple concepts learned from a single image.	Current subject-driven text-to-image models primarily focus on images with a single new concept, neglecting the efficiency offered by extracting multiple concepts from a single image. AttenCraft addresses this limitation, facilitating customization with reduced data preparation demands.	AttenCraft utilizes self-attention and cross-attention maps to generate accurate masks for each concept within a single initialization step, eliminating the need for manual labeling or specialized segmentation models. These masks guide cross-attention during training to disentangle concepts. The paper further introduces Uniform and Reweighted sampling schemes to enhance feature learning synchronicity across concepts.	AttenCraft achieves superior image-alignment scores compared to baseline models, demonstrating effective concept disentanglement. The method maintains comparable text-alignment scores with other disentangling models, indicating its ability to balance image reconstruction and editability. AttenCraft's applicability extends to more complex scenarios, effectively disentangling up to three concepts from a single input image.	The reliance on attention maps for mask creation makes AttenCraft susceptible to feature omission, especially if the pre-trained model struggles to differentiate visually similar concepts. Future work could explore incorporating techniques to refine mask creation, minimizing the risk of feature omission and further enhancing the disentanglement capability.	text-to-image generation, subject-driven generation, concept disentanglement, attention mechanism, diffusion models
2405.17958 Report	FreeSplat: Generalizable 3D Gaussian Splatting Towards Free-View Synthesis of Indoor Scenes	Yunsong Wang, Tianxin Huang, Hanlin Chen, Gim Hee Lee	Empowering 3D Gaussian Splatting with generalization ability is appealing. However, existing generalizable 3D Gaussian Splatting methods are largely confined to narrow-range interpolation between stereo images due to their heavy backbones, thus lacking the ability to accurately localize 3D Gaussian and support free-view synthesis across wide view range. In this paper, we present a novel framework FreeSplat that is capable of reconstructing geometrically consistent 3D scenes from long sequence input towards free-view synthesis.Specifically, we firstly introduce Low-cost Cross-View Aggregation achieved by constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure. Subsequently, we present the Pixel-wise Triplet Fusion to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. Additionally, we propose a simple but effective free-view training strategy that ensures robust view synthesis across broader view range regardless of the number of views. Our empirical results demonstrate state-of-the-art novel view synthesis peformances in both novel view rendered color maps quality and depth maps accuracy across different numbers of input views. We also show that FreeSplat performs inference more efficiently and can effectively reduce redundant Gaussians, offering the possibility of feed-forward large scene reconstruction without depth priors.	Presents FreeSplat, a novel framework for generalizable 3D Gaussian splatting that reconstructs geometrically consistent 3D scenes from long image sequences, enabling free view synthesis.	Existing generalizable 3D Gaussian splatting methods are limited to narrow-range interpolation between stereo images, lacking the ability to accurately localize 3D Gaussians and support free view synthesis across wide view ranges.	Introduces Low-cost Cross-View Aggregation for efficient feature extraction and matching using CNNs and adaptive cost volumes. Employs Pixel-wise Triplet Fusion to eliminate redundant 3D Gaussians and aggregate multi-view features. Proposes a Free-View Training strategy for robust view synthesis across broader view ranges.	Achieves state-of-the-art novel view synthesis performance on ScanNet, outperforming previous methods in color image quality and depth map accuracy. Demonstrates efficient inference and significant reduction in redundant Gaussians, enabling large scene reconstruction. Shows superior zero-shot transfer results on Replica for view interpolation and depth estimation.	GPU requirements become expensive when inputting extremely long image sequences. Unsupervised depth estimation scheme leads to a gap in 3D reconstruction accuracy compared to methods with 3D supervision or RGB-D input.	3d gaussian splatting, novel view synthesis, free view synthesis, indoor scene reconstruction, unsupervised depth estimation
2405.17933 Report	ToonCrafter: Generative Cartoon Interpolation	Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong	We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulting in implausible or even failed interpolation results. To overcome these limitations, we explore the potential of adapting live-action video priors to better suit cartoon interpolation within a generative framework. ToonCrafter effectively addresses the challenges faced when applying live-action video motion priors to generative cartoon interpolation. First, we design a toon rectification learning strategy that seamlessly adapts live-action video priors to the cartoon domain, resolving the domain gap and content leakage issues. Next, we introduce a dual-reference-based 3D decoder to compensate for lost details due to the highly compressed latent prior spaces, ensuring the preservation of fine details in interpolation results. Finally, we design a flexible sketch encoder that empowers users with interactive control over the interpolation results. Experimental results demonstrate that our proposed method not only produces visually convincing and more natural dynamics, but also effectively handles dis-occlusion. The comparative evaluation demonstrates the notable superiority of our approach over existing competitors.	ToonCrafter, a novel generative cartoon interpolation framework that leverages live-action video priors to overcome limitations of traditional correspondence-based methods.	Traditional methods struggle with exaggerated, non-linear motions and dis-occlusion common in cartoons, resulting in implausible or inaccurate interpolation.	The framework adapts a pre-trained image-conditioned video diffusion model using: (1) toon rectification learning to bridge the domain gap, (2) a dual-reference 3D decoder to enhance detail preservation, and (3) a sketch encoder for user control.	Significantly outperforms state-of-the-art cartoon interpolation methods in quantitative and qualitative comparisons. Effectively handles challenging cases with large non-linear motions and dis-occlusions. Allows for user control over interpolation through sparse sketch input.	Reliance on a pre-trained video diffusion model limits flexibility. Future work includes exploring higher-resolution generation and more sophisticated user control mechanisms.	cartoon animation, video interpolation, generative models, diffusion models, motion synthesis
2405.17927 Report	The Evolution of Multimodal Model Architectures	Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello	This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.	This paper identifies and characterizes four prevalent multimodal model architectural patterns (Type A, B, C, and D) in the contemporary multimodal landscape.	Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain and aids in model selection for various tasks.	The authors conduct a comprehensive exploration of architectural details in existing multimodal models, focusing on their methodologies for integrating multimodal inputs into deep neural networks. They categorize these methods into four distinct types based on the fusion strategy (deep or early) and the specific mechanisms employed.	Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing approach, is emerging as a viable alternative to Type-D, which relies on input tokenization. The choice between different types depends on factors like data and compute requirements, architecture complexity, scalability, and any-to-any modality generation capability.	The list of models provided, while comprehensive, is not exhaustive. Future work can investigate the potential of State Space Models (SSMs) for any-to-any multimodal tasks.	multimodal learning, model architectures, deep fusion, early fusion, any-to-any modality
2405.17913 Report	OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision	Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang	Open-Vocabulary Detection (OVD) aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on known category data tend to assign higher confidence to trained categories and confuse novel categories with background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method that enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy that synthesizes additional noisy query-box pairs from open-world unknown objects to trains the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at https://github.com/xiaomoguhz/OV-DQUO	This paper presents OV-DQUO, an open-vocabulary object detection framework that leverages open-world unknown object supervision and denoising text query training to address the confidence bias issue in detecting novel categories.	Existing open-vocabulary detectors, while performing well on known categories, exhibit lower confidence when detecting novel categories, often confusing them with background. This significantly limits their ability to generalize to unseen objects.	OV-DQUO uses an open-world detector to generate proposals for potential unknown objects. It then leverages wildcard matching to associate these proposals with general semantic embeddings, enabling the detector to learn from them. Further, it employs a denoising text query training strategy with synthesized noisy data to improve distinguishing novel objects from the background. Lastly, it introduces a region of query interest selection mechanism that combines objectness and region-text similarity for improved proposal selection.	OV-DQUO achieves state-of-the-art results on the OV-COCO and OV-LVIS benchmarks, surpassing existing methods by a significant margin. The framework effectively mitigates the confidence bias issue, demonstrating a more balanced confidence distribution between base and novel categories. OV-DQUO shows strong cross-dataset generalization capabilities, as demonstrated by its performance on the Objects365 dataset.	The integration of open-world detection and open-vocabulary detection within a unified end-to-end framework remains underexplored and presents an avenue for future work. Further investigation is needed to address the issue of false positive detections arising from similarities between category text embeddings.	open-vocabulary detection, open-world detection, confidence bias, wildcard matching, denoising text query training
2405.17891 Report	A Refined 3D Gaussian Representation for High-Quality Dynamic Scene Reconstruction	Bin Zhang, Bi Zeng, Zexin Peng	In recent years, Neural Radiance Fields (NeRF) has revolutionized three-dimensional (3D) reconstruction with its implicit representation. Building upon NeRF, 3D Gaussian Splatting (3D-GS) has departed from the implicit representation of neural networks and instead directly represents scenes as point clouds with Gaussian-shaped distributions. While this shift has notably elevated the rendering quality and speed of radiance fields but inevitably led to a significant increase in memory usage. Additionally, effectively rendering dynamic scenes in 3D-GS has emerged as a pressing challenge. To address these concerns, this paper purposes a refined 3D Gaussian representation for high-quality dynamic scene reconstruction. Firstly, we use a deformable multi-layer perceptron (MLP) network to capture the dynamic offset of Gaussian points and express the color features of points through hash encoding and a tiny MLP to reduce storage requirements. Subsequently, we introduce a learnable denoising mask coupled with denoising loss to eliminate noise points from the scene, thereby further compressing 3D Gaussian model. Finally, motion noise of points is mitigated through static constraints and motion consistency constraints. Experimental results demonstrate that our method surpasses existing approaches in rendering quality and speed, while significantly reducing the memory usage associated with 3D-GS, making it highly suitable for various tasks such as novel view synthesis, and dynamic mapping.	This paper introduces a novel dynamic scene rendering framework that leverages a hybrid representation of hash encoding, deformation fields, and 3D Gaussians, along with denoising masks and motion consistency constraints to mitigate noise and improve rendering quality.	Accurate and efficient rendering of dynamic scenes is crucial for various applications like AR, VR, and 3D content creation. Existing methods struggle to balance rendering quality, speed, and memory usage, particularly for dynamic scenes.	The framework employs deformation fields to model dynamic offsets of Gaussian points, utilizes hash encoding with a tiny MLP for compact color representation, and introduces a learnable denoising mask to filter out noise points. Static and motion consistency constraints are incorporated to ensure accurate learning of dynamic offsets and consistent motion.	The method achieves state-of-the-art performance on the NeRF-DS dataset for dynamic scene rendering. It significantly reduces memory usage compared to existing 3D Gaussian Splatting-based methods while maintaining high rendering quality. The framework demonstrates superior performance compared to NeRF-based approaches on synthetic datasets, particularly in preserving structural details and achieving higher PSNR and SSIM values.	The combination of hash encoding and a tiny MLP might not fully capture high-frequency color details, potentially leading to less-detailed rendering in certain cases. Inaccuracies in pose estimation within real-world datasets could result in blurring artifacts in rendered images.	dynamic scene rendering, 3d gaussian splatting, deformation fields, hash encoding, denoising mask
2405.17873 Report	MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization	Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang	Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.	This paper introduces MixDQ, a mixed-precision quantization method for memory-efficient few-step text-to-image diffusion models, addressing limitations of existing methods in preserving visual quality and text alignment.	Few-step diffusion models, while fast, have large memory footprints, hindering deployment on memory-constrained devices. Existing quantization methods struggle to maintain quality and alignment in these models, especially in the challenging one-step setting.	MixDQ employs three key components: (1) BOS-aware quantization to handle outlier values in text embeddings, (2) Metric-decoupled sensitivity analysis to separately assess impact on content and quality, (3) Integer-programming-based bit-width allocation for optimal mixed-precision configuration.	MixDQ achieves W3.66A16 and W4A8 quantization for one-step SDXL-turbo with negligible performance degradation, while baselines struggle at W8A8. It achieves 3-4x reduction in model size and memory, and 1.5x latency speedup compared to FP16 on Nvidia GPUs. Ablation studies demonstrate the effectiveness of each component, with MixDQ outperforming baselines across fidelity (FID), alignment (CLIP Score), and human preference (ImageReward).	MixDQ can be further improved by exploring specialized quantization techniques for other sensitive layers. Future work can explore combining MixDQ with advanced quantization techniques like Adaround and quantization-aware training.	diffusion models, quantization, text-to-image generation, mixed precision, model compression
2405.17871 Report	Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment	Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo	Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies. Codes are available at https://github.com/foundation-multimodal-models/CAL.	This paper introduces Contrastive Alignment (CAL), a simple yet effective token re-weighting strategy for Vision Language Models (VLMs) that prioritizes training on visually correlated text tokens, leading to enhanced image-text modality alignment.	Existing VLMs treat all text tokens equally during alignment, leading to sub-optimal performance due to the presence of visually irrelevant or contradictory tokens in training data.	CAL leverages contrastive learning by analyzing the difference in prediction logits of text tokens with and without image inputs. This difference guides the re-weighting process, prioritizing visually correlated tokens during training.	CAL consistently improves the performance of various VLMs (LLaVA, MiniGemini) across different model sizes and resolutions. Significant performance gains are observed on various benchmarks, including visual question answering, image captioning, and grounding. CAL effectively mitigates the negative impact of noisy labels in training data, leading to more robust VLM performance.	The paper lacks a clear quantitative discrepancy measure between the three kinds of label tokens (visually correlated, irrelevant, contradictory). The selection of lower and upper bounds for clamping in CAL is currently empirical and could be explored further for adaptability.	vision language models, image-text alignment, contrastive learning, token re-weighting, multimodal understanding
2405.17825 Report	Diffusion Model Patching via Mixture-of-Prompts	Seokil Ham, Sangmin Woo, Jin-Young Kim, Hyojun Go, Byeongjun Park, Changick Kim	We present Diffusion Model Patching (DMP), a simple method to boost the performance of pre-trained diffusion models that have already reached convergence, with a negligible increase in parameters. DMP inserts a small, learnable set of prompts into the model's input space while keeping the original model frozen. The effectiveness of DMP is not merely due to the addition of parameters but stems from its dynamic gating mechanism, which selects and combines a subset of learnable prompts at every step of the generative process (e.g., reverse denoising steps). This strategy, which we term "mixture-of-prompts", enables the model to draw on the distinct expertise of each prompt, essentially "patching" the model's functionality at every step with minimal yet specialized parameters. Uniquely, DMP enhances the model by further training on the same dataset on which it was originally trained, even in a scenario where significant improvements are typically not expected due to model convergence. Experiments show that DMP significantly enhances the converged FID of DiT-L/2 on FFHQ 256x256 by 10.38%, achieved with only a 1.43% parameter increase and 50K additional training iterations.	Presents Diffusion Model Patching (DMP), a method to enhance pre-trained and converged diffusion models by inserting learnable prompts into the input space and dynamically combining them based on noise levels.	Addresses the limitations of traditional fine-tuning for converged models and improves performance by introducing stage-specific capabilities.	Utilizes learnable prompts added to the input space and a dynamic gating mechanism to select and combine prompts based on noise levels during denoising.	DMP significantly improves FID scores on FFHQ, ImageNet, and MS-COCO datasets compared to baselines. Further training a converged DiT-L/2 model with DMP achieves a 10.38% FID gain on FFHQ with minimal parameter increase. Analysis reveals that DMP's success stems from its dynamic gating mechanism, which enables stage-specific prompt utilization.	The fixed number of input patches limits the flexibility in the number of prompts. Exploring alternative prompt integration methods while maintaining stable training is a potential future direction.	diffusion models, prompt tuning, parameter-efficient fine-tuning, image generation, stage-specificity
2405.17815 Report	Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model	Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang	In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer.	This paper introduces Anchor Former (AcFormer), a novel vision-language connector for Multimodal Large Language Models (MLLMs) that leverages visual anchors for efficient and accurate information aggregation.	Existing vision-language connectors in MLLMs either suffer from high computational costs due to redundant visual tokens or exhibit decreased accuracy when using learnable queries as aggregators. AcFormer aims to address these limitations by identifying and utilizing more effective information aggregators.	The authors analyze visual feature maps and attention maps from pre-trained Vision Transformers to reveal the existence of "visual anchors" crucial for information aggregation. They propose a cost-effective progressive search algorithm to extract these anchors. AcFormer then employs these anchors as Information Aggregators within a cross-attention module to generate a dense visual representation for LLM input.	AcFormer achieves comparable or superior performance to baseline models with significantly fewer visual tokens (e.g., 145 or 257 compared to 577 in LLaVA-1.5), resulting in reduced computational cost and increased speed. Ablation studies validate the efficacy of using visual anchors as Information Aggregators compared to pooling, learnable queries, or randomly selected tokens. Experiments on various benchmarks, including those requiring fine-grained visual perception, demonstrate AcFormer's effectiveness across different tasks.	The study is limited by computational resources, preventing exploration of larger training datasets and model sizes. Further theoretical analysis is needed to better understand the emergence and properties of visual anchors.	multimodal large language models, vision-language connectors, information aggregation, visual anchors, computational efficiency
2405.17811 Report	Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh	Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, Long Quan	Neural 3D representations such as Neural Radiance Fields (NeRF), excel at producing photo-realistic rendering results but lack the flexibility for manipulation and editing which is crucial for content creation. Previous works have attempted to address this issue by deforming a NeRF in canonical space or manipulating the radiance field based on an explicit mesh. However, manipulating NeRF is not highly controllable and requires a long training and inference time. With the emergence of 3D Gaussian Splatting (3DGS), extremely high-fidelity novel view synthesis can be achieved using an explicit point-based 3D representation with much faster training and rendering speed. However, there is still a lack of effective means to manipulate 3DGS freely while maintaining rendering quality. In this work, we aim to tackle the challenge of achieving manipulable photo-realistic rendering. We propose to utilize a triangular mesh to manipulate 3DGS directly with self-adaptation. This approach reduces the need to design various algorithms for different types of Gaussian manipulation. By utilizing a triangle shape-aware Gaussian binding and adapting method, we can achieve 3DGS manipulation and preserve high-fidelity rendering after manipulation. Our approach is capable of handling large deformations, local manipulations, and soft body simulations while keeping high-quality rendering. Furthermore, we demonstrate that our method is also effective with inaccurate meshes extracted from 3DGS. Experiments conducted demonstrate the effectiveness of our method and its superiority over baseline approaches.	This paper proposes Mani-GS, a novel method for manipulating 3D Gaussian Splatting (3DGS) representations using a triangular mesh as a proxy, enabling photo-realistic rendering of manipulated objects.	Manipulating 3D content while preserving rendering quality is crucial for various applications, including content creation, gaming, and VR/AR. Existing NeRF-based editing methods are either inflexible or computationally expensive. This work addresses these limitations by using 3DGS, which offers high fidelity and fast rendering but lacks efficient manipulation methods.	Mani-GS first extracts a triangular mesh from 3DGS or a neural surface field. Then, it introduces a triangle shape-aware Gaussian binding strategy, where Gaussians are bound to triangles in a local coordinate system and their attributes are optimized. Finally, mesh manipulation is directly transferred to 3DGS, leading to self-adaptation of Gaussian attributes and achieving manipulable rendering.	Mani-GS outperforms previous editing methods (NeRF-Editing, SuGaR) in terms of rendering quality, achieving higher PSNR, SSIM, and lower LPIPS on the NeRF Synthetic dataset. The method supports various manipulations, including large deformations, local manipulations (blending, reposing, elastic deformation), and soft body simulations, all while maintaining high-quality rendering. Mani-GS exhibits robustness to mesh accuracy and can generate plausible results even with inaccurate meshes extracted from 3DGS.	Highly non-rigid deformations on the mesh may lead to rendering distortions. Simulating physics on high-resolution meshes is computationally expensive, and the rendering may suffer from boundary inaccuracies if the extracted mesh has significant discrepancies from the ground truth.	gaussian splatting, 3dgs manipulation, photo-realistic rendering, mesh-based editing, triangle shape-aware binding
2405.17790 Report	Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification	Weizhen He, Yiheng Deng, Yunfeng Yan, Feng Zhu, Yizhou Wang, Lei Bai, Qingsong Xie, Donglian Qi, Wanli Ouyang, Shixiang Tang	Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions. To facilitate research in this new instruct-ReID task, we propose a large-scale OmniReID++ benchmark equipped with diverse data and comprehensive evaluation methods e.g., task specific and task-free evaluation settings. In the task-specific evaluation setting, gallery sets are categorized according to specific ReID tasks. We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework. For task-free evaluation setting, where target person images are retrieved from task-agnostic gallery sets, we further propose a new method called IRM++ with novel memory bank-assisted learning. Extensive evaluations of IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our proposed methods, achieving state-of-the-art performance on 10 test sets. The datasets, the model, and the code will be available at https://github.com/hwz-zju/Instruct-ReID	This paper proposes a novel Instruct-ReID task, a unified framework for person re-identification that incorporates instructions, encompassing six existing ReID tasks.	Current ReID methods focus on specific scenarios, leading to high deployment costs and limited performance. Instruct-ReID allows one model to handle multiple tasks, improving efficiency and leveraging diverse data for better performance.	The paper introduces the OmniReID++ benchmark, extending OmniReID with diverse data and evaluation methods. It proposes two models: IRM with adaptive triplet loss for task-specific evaluation and IRM++ with memory bank contrastive learning for task-free evaluation.	IRM achieves state-of-the-art results on 10 datasets across 6 ReID tasks under task-specific evaluation setting. IRM++ achieves significant improvement over IRM and existing state-of-the-art methods on the task-free evaluation setting. The paper proposes a novel evaluation metric, mAPτ, considering both identity correctness and instruction consistency, providing a more accurate performance evaluation.	Domain gaps between synthetic and real datasets in CC-ReID require further investigation. Selecting appropriate thresholds for the mAPτ metric warrants future research.	person re-identification, multitask learning, benchmark, instruction-guided retrieval, adaptive triplet loss
2405.17705 Report	DC-Gaussian: Improving 3D Gaussian Splatting for Reflective Dash Cam Videos	Linhan Wang, Kai Cheng, Shuo Lei, Shengkun Wang, Wei Yin, Chenyang Lei, Xiaoxiao Long, Chang-Tien Lu	We present DC-Gaussian, a new method for generating novel views from in-vehicle dash cam videos. While neural rendering techniques have made significant strides in driving scenarios, existing methods are primarily designed for videos collected by autonomous vehicles. However, these videos are limited in both quantity and diversity compared to dash cam videos, which are more widely used across various types of vehicles and capture a broader range of scenarios. Dash cam videos often suffer from severe obstructions such as reflections and occlusions on the windshields, which significantly impede the application of neural rendering techniques. To address this challenge, we develop DC-Gaussian based on the recent real-time neural rendering technique 3D Gaussian Splatting (3DGS). Our approach includes an adaptive image decomposition module to model reflections and occlusions in a unified manner. Additionally, we introduce illumination-aware obstruction modeling to manage reflections and occlusions under varying lighting conditions. Lastly, we employ a geometry-guided Gaussian enhancement strategy to improve rendering details by incorporating additional geometry priors. Experiments on self-captured and public dash cam videos show that our method not only achieves state-of-the-art performance in novel view synthesis, but also accurately reconstructing captured scenes getting rid of obstructions.	This paper introduces DC-Gaussian, a novel method for generating novel views from dash cam videos while removing obstructions like reflections and occlusions.	Dash cam videos are abundant and diverse, offering valuable data for autonomous driving applications. However, existing neural rendering techniques struggle with obstructions common in these videos, hindering their use.	DC-Gaussian builds upon 3D Gaussian Splatting (3DGS) and incorporates: 1) Adaptive image decomposition to separate background and obstructions. 2) Illumination-aware Obstruction Modeling (IOM) with a Latent Intensity Modulation (LIM) module to handle varying lighting. 3) Geometry-guided Gaussian Enhancement (G3E) to refine geometry using multi-view stereo.	DC-Gaussian outperforms state-of-the-art methods in novel view synthesis on BDD100K and DCVR datasets. The method effectively removes obstructions, producing high-fidelity renderings of both background and obstruction layers. Ablation studies demonstrate the contribution of each proposed module (AD, IOM, LIM, G3E) to the overall performance.	Currently limited to single-sequence videos. Future work could explore extending DC-Gaussian to multi-sequence videos for leveraging denser views.	novel view synthesis, 3d gaussian splatting, dash cam videos, obstruction removal, illumination-aware modeling
2405.17673 Report	Fast Samplers for Inverse Problems in Iterative Refinement Models	Kushagra Pandey, Ruihan Yang, Stephan Mandt	Constructing fast samplers for unconditional diffusion and flow-matching models has received much attention recently; however, existing methods for solving inverse problems, such as super-resolution, inpainting, or deblurring, still require hundreds to thousands of iterative steps to obtain high-quality results. We propose a plug-and-play framework for constructing efficient samplers for inverse problems, requiring only pre-trained diffusion or flow-matching models. We present Conditional Conjugate Integrators, which leverage the specific form of the inverse problem to project the respective conditional diffusion/flow dynamics into a more amenable space for sampling. Our method complements popular posterior approximation methods for solving inverse problems using diffusion/flow models. We evaluate the proposed method's performance on various linear image restoration tasks across multiple datasets, employing diffusion and flow-matching models. Notably, on challenging inverse problems like 4$\times$ super-resolution on the ImageNet dataset, our method can generate high-quality samples in as few as 5 conditional sampling steps and outperforms competing baselines requiring 20-1000 steps. Our code and models will be publicly available at https://github.com/mandt-lab/CI2RM.	A plug-and-play framework called Conditional Conjugate Integrators (CCI) for constructing efficient samplers for inverse problems using pre-trained diffusion or flow-matching models.	Existing methods for solving inverse problems with diffusion/flow models are slow, requiring hundreds to thousands of iterative steps for high-quality results. CCI accelerates these samplers by an order of magnitude.	CCI leverages the structure of linear inverse problems to project conditional diffusion/flow dynamics into a more amenable space for sampling. It separates linear and non-linear components and parameterizes the transformation by analytically solving the linear coefficients.	CCI significantly improves sampling efficiency on challenging benchmarks, like super-resolution, inpainting, and Gaussian deblurring. On 4x super-resolution on ImageNet, CCI achieves better sample quality in 5 steps than baselines in 20-1000 steps. The method demonstrates a tradeoff between guidance weight and sample quality, allowing control over artifact generation.	The current implementation relies on an Euler solver; performance could be further improved with advanced solvers. A more principled framework for non-linear inverse problems needs to be developed.	diffusion models, flow matching, inverse problems, fast sampling, image restoration
2405.17661 Report	RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance	Jiaojiao Fan, Haotian Xue, Qinsheng Zhang, Yongxin Chen	There is a rapidly growing interest in controlling consistency across multiple generated images using diffusion models. Among various methods, recent works have found that simply manipulating attention modules by concatenating features from multiple reference images provides an efficient approach to enhancing consistency without fine-tuning. Despite its popularity and success, few studies have elucidated the underlying mechanisms that contribute to its effectiveness. In this work, we reveal that the popular approach is a linear interpolation of image self-attention and cross-attention between synthesized content and reference features, with a constant rank-1 coefficient. Motivated by this observation, we find that a rank-1 coefficient is not necessary and simplifies the controllable generation mechanism. The resulting algorithm, which we coin as RefDrop, allows users to control the influence of reference context in a direct and precise manner. Besides further enhancing consistency in single-subject image generation, our method also enables more interesting applications, such as the consistent generation of multiple subjects, suppressing specific features to encourage more diverse content, and high-quality personalized video generation by boosting temporal consistency. Even compared with state-of-the-art image-prompt-based generators, such as IP-Adapter, RefDrop is competitive in terms of controllability and quality while avoiding the need to train a separate image encoder for feature injection from reference images, making it a versatile plug-and-play solution for any image or video diffusion model.	This paper introduces \ours, a training-free, plug-and-play method designed to provide flexible control over consistency in image and video generation by modifying the self-attention mechanism in diffusion models.	Controllable consistency in image and video generation is crucial for various applications but remains a challenge for foundational generative models. Existing methods are often limited by computational cost, data requirements, or lack of flexibility.	The authors reformulate concatenated attention, a popular method for consistency generation, as a linear interpolation scheme. Building upon this, they propose \rma, a flexible extension that allows for explicit control over the influence of reference images in attention modules.	\ours achieves state-of-the-art results in controlling consistency for single and multi-subject image generation, outperforming baselines like IP-Adapter and BLIPD. The method enables novel applications such as blending features from multiple images and encouraging diversity in generated images by using negative coefficients. In video generation, \ours significantly improves temporal consistency and stabilizes personalized video generation, effectively reducing flickering and preserving motion.	The model sometimes struggles to accurately reproduce specific objects in consistent image generation. Future work could explore using attention masks for more precise control and extending the method to accept clean reference images as input.	diffusion models, image generation, video generation, consistency control, attention mechanisms
2405.17532 Report	ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance	Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Yunchao Wei	Recent text-to-image customization works have been proven successful in generating images of given concepts by fine-tuning the diffusion models on a few examples. However, these methods tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g. headphone is missing when generating a dog wearing a headphone'). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g. a dog wearing a headphone) implying that the compositional ability only disappears after personalization tuning. Inspired by this observation, we present ClassDiffusion, a simple technique that leverages a semantic preservation loss to explicitly regulate the concept space when learning the new concept. Despite its simplicity, this helps avoid semantic drift when fine-tuning on the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of the fine-tune models. In response to the ineffective evaluation of CLIP-T metrics, we introduce BLIP2-T metric, a more equitable and effective evaluation metric for this particular domain. We also provide in-depth empirical study and theoretical analysis to better understand the role of the proposed loss. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.	This paper introduces ClassDiffusion, a technique to improve the compositional ability of personalized text-to-image generation models by using a semantic preservation loss during fine-tuning.	Existing personalized text-to-image models often struggle to combine customized concepts with other elements in a prompt due to overfitting during fine-tuning.	The paper analyzes the semantic drift in text space and cross-attention strength after fine-tuning. It proposes a semantic preservation loss to minimize the semantic drift of personalized concepts from their superclasses, thus retaining the ability to combine them with other elements.	ClassDiffusion effectively recovers the compositional ability of personalized text-to-image models, as demonstrated by qualitative and quantitative experiments. The paper introduces BLIP2-T Score as a more equitable and effective evaluation metric for image-text alignment compared to CLIP-T. ClassDiffusion also demonstrates potential in personalized video generation, showcasing its flexibility.	The applicability of ClassDiffusion to human-driven personalized generation, particularly for reconstructing human faces, needs further exploration. Selecting an appropriate center word for objects with combined categories requires experimentation.	text-to-image generation, personalized image synthesis, compositional generation, semantic preservation, diffusion models
2405.17531 Report	Evolutive Rendering Models	Fangneng Zhan, Hanxue Liang, Yifan Wang, Michael Niemeyer, Michael Oechsle, Adam Kortylewski, Cengiz Oztireli, Gordon Wetzstein, Christian Theobalt	The landscape of computer graphics has undergone significant transformations with the recent advances of differentiable rendering models. These rendering models often rely on heuristic designs that may not fully align with the final rendering objectives. We address this gap by pioneering \textit{evolutive rendering models}, a methodology where rendering models possess the ability to evolve and adapt dynamically throughout the rendering process. In particular, we present a comprehensive learning framework that enables the optimization of three principal rendering elements, including the gauge transformations, the ray sampling mechanisms, and the primitive organization. Central to this framework is the development of differentiable versions of these rendering elements, allowing for effective gradient backpropagation from the final rendering objectives. A detailed analysis of gradient characteristics is performed to facilitate a stable and goal-oriented elements evolution. Our extensive experiments demonstrate the large potential of evolutive rendering models for enhancing the rendering performance across various domains, including static and dynamic scene representations, generative modeling, and texture mapping.	Introduces Evolutive Rendering Models (ERMs) that replace heuristic design choices in rendering models with learnable components optimized for specific rendering objectives.	Traditional rendering models rely on fixed, potentially sub-optimal heuristics. ERMs address this by enabling autonomous adaptation throughout the rendering process, leading to improved performance.	Introduces differentiable versions of three key rendering elements: gauge transformations, ray sampling, and primitive organization. This allows gradient-based optimization directly from the final rendering objective using a novel relay learning mechanism.	Evolutive gauge transformations enhance rendering quality in static, dynamic, and generative modeling. Evolutive ray sampling improves both the efficiency and quality of volumetric rendering. Evolutive primitive organization, particularly in Gaussian Splatting, leads to faster training, reduced memory footprint, and improved visual details.	Current work focuses on evolving individual elements; integrating all three remains unexplored. The added learnable components typically result in increased training time.	neural rendering, differentiable rendering, gauge transformation, ray sampling, primitive organization
2405.17472 Report	FreezeAsGuard: Mitigating Illegal Adaptation of Diffusion Models via Selective Tensor Freezing	Kai Huang, Wei Gao	Text-to-image diffusion models can be fine-tuned in custom domains to adapt to specific user preferences, but such unconstrained adaptability has also been utilized for illegal purposes, such as forging public figures' portraits and duplicating copyrighted artworks. Most existing work focuses on detecting the illegally generated contents, but cannot prevent or mitigate illegal adaptations of diffusion models. Other schemes of model unlearning and reinitialization, similarly, cannot prevent users from relearning the knowledge of illegal model adaptation with custom data. In this paper, we present FreezeAsGuard, a new technique that addresses these limitations and enables irreversible mitigation of illegal adaptations of diffusion models. The basic approach is that the model publisher selectively freezes tensors in pre-trained diffusion models that are critical to illegal model adaptations, to mitigate the fine-tuned model's representation power in illegal domains but minimize the impact on legal model adaptations in other domains. Such tensor freezing can be enforced via APIs provided by the model publisher for fine-tuning, can motivate users' adoption due to its computational savings. Experiment results with datasets in multiple domains show that FreezeAsGuard provides stronger power in mitigating illegal model adaptations of generating fake public figures' portraits, while having the minimum impact on model adaptation in other legal domains. The source code is available at: https://github.com/pittisl/FreezeAsGuard/	This paper introduces FreezeAsGuard, a novel technique to irreversibly mitigate illegal adaptations of text-to-image diffusion models (e.g., generating fake portraits) by selectively freezing critical tensors during fine-tuning.	Existing methods for mitigating misuse of open-sourced diffusion models, like watermarking and unlearning, are reversible and cannot prevent re-learning illegal knowledge through fine-tuning.	FreezeAsGuard uses bilevel optimization to train a binary mask indicating which tensors to freeze. This mask is optimized to maximize degradation in illegal domains (e.g., specific public figures) while minimizing impact on performance in innocent domains (e.g., logos, clothes).	FreezeAsGuard effectively mitigates generating fake portraits, reducing image quality by 14% compared to fully fine-tuned models, making subjects unrecognizable. It minimally impacts legal adaptations, achieving comparable or better image quality in innocent domains than unlearning methods. It offers computational benefits, saving up to 48% GPU memory and 21% time during fine-tuning.	The optimal freezing ratio may vary across different diffusion models and illegal domain scales. Future work includes exploring other applications of FreezeAsGuard for various generative models.	diffusion models, generative ai, model misuse, illegal content mitigation, tensor freezing
2405.17461 Report	EMR-Merging: Tuning-Free High-Performance Model Merging	Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, Wanli Ouyang	The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or training. In this paper, we rethink and analyze the existing model merging paradigm. We discover that using a single model's weights can hardly simulate all the models' performance. To tackle this issue, we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a unified model from all the model weights and then (b) generate extremely lightweight task-specific modulators, including masks and rescalers, to align the direction and magnitude between the unified model and each specific model, respectively. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance. We find that EMR-Merging shows outstanding performance compared to existing merging methods under different classical and newly-established settings, including merging different numbers of vision models (up to 30), NLP models, PEFT models, and multi-modal models.	This paper proposes EMR-Merging, a novel, tuning-free model merging method that combines a unified task vector with lightweight, task-specific modulators (masks and rescalers) to improve the performance of merged models.	Model merging is important for reducing storage and deployment costs associated with using multiple single-task models. Existing methods suffer from performance degradation or require tuning with additional data or training.	EMR-Merging first elects a unified task vector from multiple task-specific vectors, maximizing shared sign and magnitude information. Then, it generates task-specific masks to align direction and rescalers to align magnitude with individual task vectors.	EMR-Merging significantly outperforms existing merging methods on various vision, NLP, PEFT, and multi-modal benchmarks. The method achieves performance comparable to traditional multi-task learning (MTL) but without requiring additional data or training. EMR-Merging maintains strong performance even when merging a large number of models (up to 30) on challenging tasks.	Requires slightly more memory compared to some existing methods due to storing task-specific modulators. Not directly applicable to models trained from scratch as it relies on the pretrain-finetune paradigm.	model merging, multi-task learning, parameter efficiency, tuning-free, vision and language models
2405.17430 Report	Matryoshka Multimodal Models	Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee	Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.	This paper presents \shortname{}: \fullname{}, a novel approach that enhances the efficiency and adaptability of Large Multimodal Models (LMMs) by representing visual content as nested sets of tokens with varying granularities.	Current LMMs often struggle with the computational demands of high-resolution images and videos due to their reliance on a fixed and large number of visual tokens.	\shortname{} leverages a Matryoshka doll-like structure to encode visual information at multiple levels of detail, enabling flexible control over the number of visual tokens used during inference based on factors like content complexity and efficiency constraints. This is achieved by training the LMM to predict the next token in the text sequence based on a hierarchy of visual token sets derived from CLIP visual features, where coarser token sets are subsets of finer ones.	\shortname{} maintains or improves upon the performance of baseline LMMs while using significantly fewer tokens, especially in scenarios involving dense visual information like document understanding. Analysis of \shortname{}'s performance across different visual token scales reveals biases in existing vision-language datasets, suggesting that many benchmarks can achieve comparable results with far fewer tokens than currently used. A significant gap exists between the oracle upper bound (i.e., the best possible performance achievable with the fewest tokens) and the model's actual performance at specific scales, highlighting the potential for further optimization.	The paper lacks an effective visual token predictor that could dynamically select the optimal token scale for each input, bridging the gap between oracle performance and current results. The study primarily focuses on image and video understanding tasks, leaving exploration of its applicability to other domains like 3D understanding or audio-visual tasks for future work.	large multimodal models, token reduction, adaptive representation learning, vision-language reasoning, efficiency optimization
2405.17429 Report	GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction	Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu	3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer.	This paper proposes GaussianFormer, a novel approach for 3D semantic occupancy prediction that leverages an object-centric representation based on 3D semantic Gaussians.	Existing voxel and BEV-based methods for 3D occupancy prediction suffer from redundancy due to their grid-based nature, leading to inefficient resource allocation. GaussianFormer addresses this by using sparse 3D Gaussians to flexibly represent regions of interest, improving efficiency and capturing fine-grained details.	GaussianFormer employs a transformer architecture with self-encoding, image cross-attention, and refinement modules to iteratively learn meaningful 3D Gaussians from multi-view images. An efficient Gaussian-to-voxel splatting module then generates dense 3D occupancy predictions.	GaussianFormer achieves comparable performance to state-of-the-art methods on nuScenes and KITTI-360 datasets for multi-view and monocular 3D semantic occupancy prediction. GaussianFormer demonstrates superior efficiency compared to existing methods, reducing memory consumption by 75.2% - 82.2% while maintaining competitive latency. The ablation study validates the effectiveness of individual components in GaussianFormer, including the refinement strategy, sparse convolution, and deep supervision.	The performance of GaussianFormer, although comparable, is slightly lower than some state-of-the-art methods, suggesting room for improvement in representation accuracy or hyperparameter tuning. GaussianFormer requires a large number of Gaussians for satisfactory performance, which could be further optimized by exploring alternative strategies to represent empty space.	3d occupancy prediction, 3d gaussian splatting, autonomous driving, object-centric representation, vision-based perception
2405.17421 Report	MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds	Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, Kostas Daniilidis	We introduce 4D Motion Scaffolds (MoSca), a neural information processing system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models, lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions / deformations. The scene geometry and appearance are then disentangled from the deformation field, and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera poses can be seamlessly initialized and refined during the dynamic rendering process, without the need for other pose estimation tools. Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks.	Introduces 4D Motion Scaffolds (MoSca), a system for reconstructing and synthesizing novel views of dynamic scenes from casual monocular videos.	Addresses the challenging and ill-posed inverse problem of reconstructing dynamic scenes from limited information in casual videos.	Leverages pretrained vision models for initial priors, lifts video data to a compact Motion Scaffold representation encoding deformations, disentangles geometry and appearance, and uses Gaussian Splatting for rendering and optimization.	Achieves state-of-the-art performance on dynamic rendering benchmarks like DyCheck. Enables global fusion of observations across the entire video, leading to more complete reconstructions. Offers a COLMAP-free solution for camera pose estimation in dynamic scenes.	Relies on the accuracy of 2D foundational models like trackers and depth estimators. Limited to reconstructing visible areas, with future work exploring the use of diffusion models for hallucinating unseen regions.	novel view synthesis, dynamic scene reconstruction, motion scaffolds, gaussian splatting, foundation models
2405.17414 Report	Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control	Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas Guibas, Gordon Wetzstein	Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. Project page: https://collaborativevideodiffusion.github.io/.	This paper introduces Collaborative Video Diffusion (CVD), a novel method for generating multiple videos of the same scene from different camera trajectories while ensuring consistency in content and motion.	Existing video generation models struggle to maintain consistency when generating multiple videos of the same scene from different viewpoints. CVD addresses this limitation, paving the way for applications like large-scale 3D scene generation with editable camera trajectories.	CVD leverages a cross-video synchronization module with epipolar attention to align features across videos. It employs a hybrid training scheme using RealEstate10K (for static scenes and camera poses) and WebVid10M (for dynamic scenes) to overcome the lack of large-scale multi-view dynamic datasets. A collaborative inference algorithm extends the model to generate an arbitrary number of consistent videos.	CVD outperforms baselines in generating videos with consistent geometry, as demonstrated by quantitative evaluations using SuperGlue for camera pose estimation. It exhibits superior semantic consistency across videos, as evidenced by CLIP-based metrics for comparing frame content. CVD maintains high fidelity in generated content, achieving competitive FID and KID scores compared to baselines.	The performance of CVD is inherently dependent on the capabilities of its base video diffusion models (AnimateDiff and CameraCtrl). Real-time video synthesis is currently not feasible due to the computational demands of diffusion models.	video generation, diffusion models, camera control, multi-view consistency, epipolar geometry
2405.17405 Report	Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer	Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, Yebin Liu	We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across views, time, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we curate a multi-dimensional dataset spanning images, videos, multi-view data and 3D/4D scans, along with a multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on GAN or UNet-based diffusion models, which struggle with complex motions and viewpoint changes. Through extensive experiments, we demonstrate our method's ability to synthesize realistic, coherent and free-view human videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation. Our project website is https://human4dit.github.io.	This paper introduces Human4DiT, a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints using a 4D diffusion transformer.	Generating realistic human videos is crucial for various multimedia applications, including virtual reality, animation, gaming, and movie production. Existing methods struggle with complex motions, viewpoint changes, and spatio-temporal consistency.	The framework combines U-Nets for accurate condition injection and a cascaded 4D diffusion transformer for capturing global correlations across viewpoints and time. It utilizes a multi-dimensional dataset and training strategy, along with a spatio-temporally consistent diffusion sampling method during inference.	Human4DiT outperforms state-of-the-art methods in generating monocular, multi-view, 3D static, and free-view human videos. The 4D diffusion transformer effectively captures spatio-temporal correlations, resulting in more natural dynamic effects and fewer artifacts. The method demonstrates the ability to generate coherent free-viewpoint videos with varying camera trajectories.	Lack of an explicit 4D representation leads to subtle artifacts in free-view videos. The current implementation struggles with generating small structures like fingers and accessories.	human video generation, diffusion models, diffusion transformers, view synthesis, 4d content generation
2405.17401 Report	RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control	Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu	We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control of content and style in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets.	This paper proposes Reference-Based Modulation (RB-Modulation), a plug-and-play method for training-free personalization of diffusion models, enabling stylization and content-style composition using a single reference image.	Current training-free methods struggle with style extraction, content leakage from reference images, and effective composition. RB-Modulation addresses these limitations by modulating the drift field in diffusion models using a novel stochastic optimal control framework.	The method leverages a stochastic optimal controller that incorporates a style descriptor in its terminal cost to guide the reverse diffusion process. It also introduces an Attention Feature Aggregation (AFA) module to disentangle content and style within cross-attention layers, ensuring prompt alignment and high fidelity to the reference image.	RB-Modulation successfully performs stylization and content-style composition using only a single reference image, outperforming state-of-the-art training-free methods. Human evaluation confirms superior performance in style alignment, prompt alignment, and overall quality compared to alternatives. Theoretical analysis connects optimal control and reverse diffusion dynamics, providing insights into the method's effectiveness.	The method's performance might be limited by the quality of the chosen style descriptor and the pre-trained diffusion model. Future work can explore alternative style descriptors and apply the framework to various diffusion models with diverse datasets.	diffusion models, image stylization, content-style composition, stochastic optimal control, training-free personalization
2405.17398 Report	Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability	Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li	World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.	This paper presents \textit{\modelname}, a generalizable driving world model that predicts realistic and continuous futures at high spatiotemporal resolution, featuring versatile action controllability across unseen scenarios and serving as a reward function for action evaluation.	Existing driving world models lack sufficient generalization to unseen environments, struggle to predict critical details at high fidelity, and often support limited action control modalities, hindering their practical application in autonomous driving.	The model leverages a latent replacement approach to inject dynamic priors, promoting coherent future prediction. Two novel losses, a dynamics enhancement loss and a structure preservation loss, enhance prediction fidelity. Versatile action controllability is achieved through a unified conditioning interface and efficient learning strategy using both labeled and unlabeled driving datasets.	\modelname outperforms state-of-the-art driving world models on nuScenes by a significant margin in FID and FVD scores. Human evaluation across diverse datasets confirms its superior visual quality and motion rationality compared to general-purpose video generators. The model demonstrates potential as a generalizable reward function, effectively evaluating actions based on prediction uncertainty.	The model's computational efficiency needs improvement for real-world deployment. Further work is needed to maintain prediction quality in long-horizon rollouts and during drastic view shifts.	world models, autonomous driving, video generation, action controllability, reward function
2405.17393 Report	EASI-Tex: Edge-Aware Mesh Texturing from Single Image	Sai Raj Kishore Perla, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang	We present a novel approach for single-image mesh texturing, which employs a diffusion model with judicious conditioning to seamlessly transfer an object's texture from a single RGB image to a given 3D mesh object. We do not assume that the two objects belong to the same category, and even if they do, there can be significant discrepancies in their geometry and part proportions. Our method aims to rectify the discrepancies by conditioning a pre-trained Stable Diffusion generator with edges describing the mesh through ControlNet, and features extracted from the input image using IP-Adapter to generate textures that respect the underlying geometry of the mesh and the input texture without any optimization or training. We also introduce Image Inversion, a novel technique to quickly personalize the diffusion model for a single concept using a single image, for cases where the pre-trained IP-Adapter falls short in capturing all the details from the input image faithfully. Experimental results demonstrate the efficiency and effectiveness of our edge-aware single-image mesh texturing approach, coined EASI-Tex, in preserving the details of the input texture on diverse 3D objects, while respecting their geometry.	EASI-Tex is a novel, efficient, optimization-free approach for transferring textures from a single RGB image to a 3D mesh, respecting both the input texture and the mesh's geometry.	Existing methods struggle to accurately transfer textures from a single image while preserving the 3D model's geometric details and semantic identity.	The method leverages a pre-trained Stable Diffusion model with ControlNet for edge conditioning from the mesh and IP-Adapter for conditioning on features extracted from the input texture image. It also introduces "Image Inversion" to personalize the diffusion model for complex textures using a single image.	EASI-Tex demonstrates superior preservation of input texture details and better respects the 3D mesh's geometry compared to baselines. It offers control over the degree of texture transfer using a tunable parameter. The method is significantly faster than optimization-based alternatives and doesn't require per-texture fine-tuning like existing personalization-based methods.	The input resolution of the CLIP image encoder in IP-Adapter limits the capture of fine texture details. Texture seams may appear due to the iterative texture pasting strategy in the employed mesh texturing technique.	3d mesh texturing, texture transfer, diffusion models, single image, edge-aware
2405.17351 Report	DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Refocusing,Defocus Rendering and Blur Removal	Yujie Wang, Praneeth Chakravarthula, Baoquan Chen	3D Gaussian Splatting-based techniques have recently advanced 3D scene reconstruction and novel view synthesis, achieving high-quality real-time rendering. However, these approaches are inherently limited by the underlying pinhole camera assumption in modeling the images and hence only work for All-in-Focus (AiF) sharp image inputs. This severely affects their applicability in real-world scenarios where images often exhibit defocus blur due to the limited depth-of-field (DOF) of imaging devices. Additionally, existing 3D Gaussian Splatting (3DGS) methods also do not support rendering of DOF effects. To address these challenges, we introduce DOF-GS that allows for rendering adjustable DOF effects, removing defocus blur as well as refocusing of 3D scenes, all from multi-view images degraded by defocus blur. To this end, we re-imagine the traditional Gaussian Splatting pipeline by employing a finite aperture camera model coupled with explicit, differentiable defocus rendering guided by the Circle-of-Confusion (CoC). The proposed framework provides for dynamic adjustment of DOF effects by changing the aperture and focal distance of the underlying camera model on-demand. It also enables rendering varying DOF effects of 3D scenes post-optimization, and generating AiF images from defocused training images. Furthermore, we devise a joint optimization strategy to further enhance details in the reconstructed scenes by jointly optimizing rendered defocused and AiF images. Our experimental results indicate that DOF-GS produces high-quality sharp all-in-focus renderings conditioned on inputs compromised by defocus blur, with the training process incurring only a modest increase in GPU memory consumption. We further demonstrate the applications of the proposed method for adjustable defocus rendering and refocusing of the 3D scene from input images degraded by defocus blur.	DOF-GS, a novel 3D Gaussian Splatting framework that handles defocus blur in input images and enables adjustable depth-of-field (DOF) effects in rendered images.	Existing 3DGS methods are limited by the pinhole camera model and require all-in-focus inputs, hindering their applicability to real-world blurry images and DOF rendering.	DOF-GS employs a finite aperture camera model, CoC-guided DOF rendering, learnable camera parameters (aperture, focal distance) per view, and a joint optimization strategy leveraging an In-Focus Localization Network (ILN).	DOF-GS successfully reconstructs scenes from blurry multi-view images, outperforming existing methods in synthesizing high-quality novel views. The method allows for adjustable DOF effects by manipulating aperture and focal distance parameters during rendering. DOF-GS demonstrates superior GPU memory efficiency compared to methods relying on neural modules for blur simulation.	Current implementation relies on pre-estimated camera poses, which can be inaccurate due to blur in inputs. Future work will explore joint optimization of camera poses to further enhance reconstruction quality.	3d gaussian splatting, depth-of-field, defocus blur, novel view synthesis, refocusing
2405.17306 Report	Controllable Longer Image Animation with Diffusion Models	Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu	Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/	This paper proposes a novel method for generating controllable and longer image animations using diffusion models, leveraging motion priors derived from optical flow fields to guide the animation process.	Existing image animation methods often struggle with precise motion control, especially in open-domain settings, and generating longer videos with consistent content and motion.	The proposed method extracts motion fields from training videos and utilizes them as conditional constraints for diffusion models. It employs a refinement model to enhance user-provided sparse trajectories and incorporates global motion strength guidance. Additionally, it introduces a phased inference strategy and shared noise rescheduling for generating longer videos with better consistency.	The method achieves superior quantitative results compared to several open-source methods and commercial tools, demonstrating its effectiveness in generating high-quality animations. It allows precise control over the direction, speed, and strength of object motion, enabling realistic and user-intended animations. The proposed longer video generation method effectively maintains temporal consistency and visual coherence, outperforming existing techniques.	The current reliance on optical flow for motion description limits the capacity for content constraints. Future work will explore more flexible multi-condition controls, such as incorporating sketch or depth information.	image-to-video, diffusion models, controllable generation, image animation, motion priors
2405.17258 Report	$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning	Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky	Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose $\textit{Trans-LoRA}$ -- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using large language models, we design a synthetic data generator to approximate the data-generating process of the $\textit{observed}$ task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.	This paper proposes \method{}, a novel approach for lossless and data-efficient transfer of LoRA modules across different base language models, addressing the challenge of model deprecation in cloud applications.	LoRA modules are tied to specific base models, requiring retraining when base models are updated. This is problematic in cloud settings where client data used for LoRA training is often confidential and inaccessible.	\method{} uses a synthetic data generator (guided by a few seed examples) and a discriminator trained on real and synthetic data to create a distillation curriculum for transferring LoRA parameters to new base models.	Lossless LoRA transfer is achieved, with transferred LoRAs matching or exceeding source LoRA performance on various tasks and across different LLM families (Llama, Gemma). The method demonstrates positive transfer, often outperforming both the source LoRA and the target base model by combining knowledge from both. Transfer is effective across different PEFT methods (LoRA, DoRA, Prompt Tuning) and remains robust in continuous transfer scenarios (simulating multiple model updates).	The approach requires an initial synthetic data generation step, introducing additional computation. In rare cases, insufficient task understanding by the synthesizer may lead to suboptimal transfer, requiring adjustments in seed sample size.	lora, peft, transfer learning, knowledge distillation, synthetic data
2405.17251 Report	GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping	Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji	Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/.	This paper introduces GenWarp, a novel view synthesis framework that learns where to warp and where to generate in images, enabling the creation of high-quality novel views from single images.	Existing methods for single-shot novel view synthesis struggle with noisy depth maps and loss of semantic details, particularly at large viewpoint changes. GenWarp addresses these limitations by leveraging the generative prior of text-to-image diffusion models and incorporating geometric warping signals.	GenWarp uses a two-stream architecture consisting of a semantic preserver network and a diffusion model. It integrates monocular depth estimation (MDE) with warped coordinate embeddings and augments self-attention with cross-view attention to guide the generation process.	GenWarp effectively handles noisy depth maps and preserves semantic details from the input view, outperforming existing methods in terms of FID and PSNR. It demonstrates strong generalization capability, effectively synthesizing novel views for in-the-wild images including AI-generated images. The model exhibits robustness to varying camera viewpoints and scene types.	GenWarp may struggle with generating novel views from extremely distant viewpoints where depth-based correspondence is not effective. The performance of the model is influenced by the quality of multi-view datasets used for fine-tuning.	novel view synthesis, generative models, diffusion models, single-shot, semantic preservation
2405.17187 Report	Memorize What Matters: Emergent Scene Decomposition from Multitraverse	Yiming Li, Zehong Wang, Yue Wang, Zhiding Yu, Zan Gojcic, Marco Pavone, Chen Feng, Jose M. Alvarez	Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 2D segmentation and 3D mapping without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.	Presents 3D Gaussian Mapping (3DGM), a self-supervised and camera-only framework for simultaneous 3D environment mapping and 2D unsupervised object segmentation from multi-traversal driving data.	Addresses limitations of existing 3D mapping methods that rely on pre-trained segmentation models or LiDAR by exploiting the consistency of environments and transience of objects across multiple traversals.	Utilizes Structure from Motion for initialization and leverages a robust differentiable rendering pipeline with feature distillation and residuals mining to jointly optimize 3D environmental Gaussians and 2D ephemerality masks.	Achieves comparable unsupervised 2D segmentation performance to supervised methods, outperforming state-of-the-art unsupervised techniques by a significant margin. Demonstrates accurate 3D environment reconstruction from camera-only input, achieving a lower Chamfer Distance compared to a LiDAR-based baseline. Shows promising results in novel view synthesis, effectively rendering environments while excluding transient objects and their shadows.	Faces challenges in handling large environmental variations like nighttime and seasonal changes. Segmentation can be affected by motion blur, appearance shifts, and difficulties in segmenting shadows and reflective surfaces.	3d mapping, self-supervised learning, unsupervised segmentation, gaussian splatting, autonomous driving
2405.17176 Report	DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models	Yuqing Zhang, Yuan Liu, Zhiyu Xie, Lei Yang, Zhongyuan Liu, Mengzhou Yang, Runze Zhang, Qilong Kou, Cheng Lin, Wenping Wang, Xiaogang Jin	2D diffusion model, which often contains unwanted baked-in shading effects and results in unrealistic rendering effects in the downstream applications. Generating Physically Based Rendering (PBR) materials instead of just RGB textures would be a promising solution. However, directly distilling the PBR material parameters from 2D diffusion models still suffers from incorrect material decomposition, such as baked-in shading effects in albedo. We introduce DreamMat, an innovative approach to resolve the aforementioned problem, to generate high-quality PBR materials from text descriptions. We find out that the main reason for the incorrect material distillation is that large-scale 2D diffusion models are only trained to generate final shading colors, resulting in insufficient constraints on material decomposition during distillation. To tackle this problem, we first finetune a new light-aware 2D diffusion model to condition on a given lighting environment and generate the shading results on this specific lighting condition. Then, by applying the same environment lights in the material distillation, DreamMat can generate high-quality PBR materials that are not only consistent with the given geometry but also free from any baked-in shading effects in albedo. Extensive experiments demonstrate that the materials produced through our methods exhibit greater visual appeal to users and achieve significantly superior rendering quality compared to baseline methods, which are preferable for downstream tasks such as game and film production.	DreamMat: A novel method for generating high-quality, text-guided PBR materials on untextured 3D meshes.	Existing text-to-3D appearance generation methods often produce unrealistic results due to baked-in shading effects in generated textures, limiting their use in rendering pipelines.	DreamMat distills a geometry- and light-aware diffusion model, leveraging a hash-grid-based material representation and a classifier score distillation (CSD) loss. This approach ensures consistency with input geometry, text prompts, and lighting conditions.	Generates high-quality albedo, roughness, and metallic maps disentangled from lighting. Exhibits superior visual fidelity and text alignment compared to baseline methods. Produces materials compatible with modern graphics engines, enabling realistic renderings under diverse lighting.	Limited support for complex materials like transparent or highly reflective surfaces due to the simplified BRDF model. Relatively long distillation time (around 20 minutes) hindering interactive applications.	text-guided synthesis, 3d material generation, inverse rendering, diffusion models, pbr materials
2405.17158 Report	PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution	Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Xing Mei, Lean Fu, Fei Wang	Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution (SR) method, designed to enhance the efficiency of the inference process.The proposed method is motivated by the observation that not all the image patches within an image need the same sampling steps for reconstructing high-resolution images.Based on this observation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide feature patches into different groups according to the patch-level reconstruction difficulty and dynamically assign an appropriate sampling configuration for each group so that the inference speed can be better accelerated.In addition, to improve the denoising ability at each step of the sampling, we develop a texture prompt to guide the estimations of the diffusion model by retrieving high-quality texture priors from a patch-independent reference texture memory.Experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations with fast inference speed.Our code and model are available at \url{https://github.com/yongliuy/PatchScaler}.	This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution method designed for efficient inference. It employs patch-adaptive group sampling to tailor sampling configurations to individual patches based on their reconstruction difficulty.	Diffusion models excel at super-resolution but suffer from high computational costs due to numerous sampling steps applied uniformly to the entire image, even if some patches require fewer steps.	The method uses a global restoration module to generate a coarse HR image and a confidence map. Patches are grouped by difficulty, and a patch-adaptive group sampling strategy determines an optimal starting point for reverse denoising, reducing steps. A texture prompt enhances detail reconstruction by retrieving similar texture priors.	PatchScaler achieves faster inference speeds compared to other diffusion-based SR methods, particularly for high-resolution images. It outperforms state-of-the-art SR methods on perceptual quality metrics like ManIQA, CLIPIQA, and MUSIQ. The proposed texture prompt proves more effective than traditional text prompts for SISR due to better alignment with image content.	The model's performance might be limited by training from scratch and the inherent degradation of diffusion models at lower resolutions. Future work includes exploring the application of PatchScaler to other low-level vision tasks like video super-resolution, image deblurring, and HDR.	super-resolution, diffusion models, patch-based processing, efficient inference, texture synthesis
2405.17083 Report	F-3DGS: Factorized Coordinates and Representations for 3D Gaussian Splatting	Xiangyu Sun, Joo Chan Lee, Daniel Rho, Jong Hwan Ko, Usman Ali, Eunbyung Park	The neural radiance field (NeRF) has made significant strides in representing 3D scenes and synthesizing novel views. Despite its advancements, the high computational costs of NeRF have posed challenges for its deployment in resource-constrained environments and real-time applications. As an alternative to NeRF-like neural rendering methods, 3D Gaussian Splatting (3DGS) offers rapid rendering speeds while maintaining excellent image quality. However, as it represents objects and scenes using a myriad of Gaussians, it requires substantial storage to achieve high-quality representation. To mitigate the storage overhead, we propose Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that drastically reduces storage requirements while preserving image quality. Inspired by classical matrix and tensor factorization techniques, our method represents and approximates dense clusters of Gaussians with significantly fewer Gaussians through efficient factorization. We aim to efficiently represent dense 3D Gaussians by approximating them with a limited amount of information for each axis and their combinations. This method allows us to encode a substantially large number of Gaussians along with their essential attributes -- such as color, scale, and rotation -- necessary for rendering using a relatively small number of elements. Extensive experimental results demonstrate that F-3DGS achieves a significant reduction in storage costs while maintaining comparable quality in rendered images.	This paper proposes Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that significantly reduces the storage requirements of 3D Gaussian Splatting (3DGS) while preserving comparable image quality.	3DGS, while offering fast rendering speeds and excellent image quality for 3D scene representation, often necessitates a large number of Gaussians and their attributes, leading to high storage costs and hindering its practicality in resource-constrained environments.	F-3DGS leverages matrix and tensor factorization techniques, inspired by classical and neural rendering factorization methods. It employs a factorized coordinate scheme and decomposes Gaussian attributes (color, scale, rotation, opacity) to efficiently compress the model size.	F-3DGS achieves comparable image quality to 3DGS while drastically reducing storage costs, exceeding 90% reduction in some cases. The method maintains fast rendering speeds, making it suitable for real-time applications. Evaluations on synthetic-NeRF, Tanks & Temples, and Mip-NeRF 360 datasets demonstrate the effectiveness of F-3DGS.	The current implementation primarily focuses on optimizing F-3DGS for smaller scenes; further research is needed to enhance its applicability to large, unbounded scenes. The initialization scheme, while effective, relies on pre-trained 3DGS models; exploring alternative initialization strategies could be beneficial.	3d gaussian splatting, 3d reconstruction, real-time rendering, tensor factorization, compression
2405.17082 Report	Ensembling Diffusion Models via Adaptive Feature Aggregation	Cong Wang, Kuan Tian, Yonghang Guan, Jun Zhang, Zhiwei Jiang, Fei Shen, Xiao Han, Qing Gu, Wei Yang	The success of the text-guided diffusion model has inspired the development and release of numerous powerful diffusion models within the open-source community. These models are typically fine-tuned on various expert datasets, showcasing diverse denoising capabilities. Leveraging multiple high-quality models to produce stronger generation ability is valuable, but has not been extensively studied. Existing methods primarily adopt parameter merging strategies to produce a new static model. However, they overlook the fact that the divergent denoising capabilities of the models may dynamically change across different states, such as when experiencing different prompts, initial noises, denoising steps, and spatial locations. In this paper, we propose a novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages. Specifically, we design a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator that adaptive aggregates the block-wise intermediate features from multiple U-Net denoisers into a unified one. The core idea lies in dynamically producing an individual attention map for each model's features by comprehensively considering various states. It is worth noting that only SABW is trainable with about 50 million parameters, while other models are frozen. Both the quantitative and qualitative experiments demonstrate the effectiveness of our proposed Adaptive Feature Aggregation method. The code is available at https://github.com/tenvence/afa/.	This paper presents Adaptive Feature Aggregation (AFA), a novel ensembling method for text-guided diffusion models that dynamically adjusts contributions from multiple models based on various factors like prompts, noises, and denoising steps.	Leveraging the diverse strengths of numerous open-source diffusion models, fine-tuned on various datasets, is crucial for achieving better image generation quality and contextual alignment.	AFA utilizes a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator to dynamically combine intermediate features from multiple U-Net denoisers based on learned spatial attention maps, considering various states like prompts, noises, and denoising steps.	AFA consistently outperforms individual base models and baseline methods in terms of image quality and context alignment. AFA exhibits robust performance even with fewer inference steps, leading to comparable computational efficiency to single model inference. Visualization of attention maps showcases AFA's capability to adaptively leverage different models based on context and timestep.	AFA's single inference step can be computationally demanding due to running all base models. Future work includes exploring more efficient aggregator designs and training strategies to further enhance efficiency.	image generation, diffusion models, model ensembling, text-to-image synthesis, adaptive feature aggregation
2405.17069 Report	Training-free Editioning of Text-to-Image Models	Jinqi Wang, Yunfei Fu, Zhangcan Ding, Bailin Deng, Yu-Kun Lai, Yipeng Qin	Inspired by the software industry's practice of offering different editions or versions of a product tailored to specific user groups or use cases, we propose a novel task, namely, training-free editioning, for text-to-image models. Specifically, we aim to create variations of a base text-to-image model without retraining, enabling the model to cater to the diverse needs of different user groups or to offer distinct features and functionalities. To achieve this, we propose that different editions of a given text-to-image model can be formulated as concept subspaces in the latent space of its text encoder (e.g., CLIP). In such a concept subspace, all points satisfy a specific user need (e.g., generating images of a cat lying on the grass/ground/falling leaves). Technically, we apply Principal Component Analysis (PCA) to obtain the desired concept subspaces from representative text embedding that correspond to a specific user need or requirement. Projecting the text embedding of a given prompt into these low-dimensional subspaces enables efficient model editioning without retraining. Intuitively, our proposed editioning paradigm enables a service provider to customize the base model into its "cat edition" (or other editions) that restricts image generation to cats, regardless of the user's prompt (e.g., dogs, people, etc.). This introduces a new dimension for product differentiation, targeted functionality, and pricing strategies, unlocking novel business models for text-to-image generators. Extensive experimental results demonstrate the validity of our approach and its potential to enable a wide range of customized text-to-image model editions across various domains and applications.	This paper introduces "training-free editioning" for text-to-image models, enabling customization without retraining by projecting text embeddings into concept subspaces.	This approach addresses the challenge of tailoring text-to-image models to specific needs and unlocks new business models for service providers.	The method leverages PCA on representative text embeddings to create concept subspaces, each corresponding to a specific domain or attribute, and then projects input prompt embeddings into these subspaces.	Concept subspace projection successfully restricts image generation to the desired concept (e.g., a "cat edition" only generates cat images). The method maintains high image quality and diversity, comparable to the base model (Stable Diffusion). Projected embeddings exhibit close proximity to their "replaced" counterparts, indicating successful projection.	The current work focuses on a basic linguistic template and a limited word list. Further exploration is needed for complex prompt structures and a wider range of concepts.	text-to-image synthesis, model editioning, concept subspaces, clip embeddings, pca
2405.17013 Report	MotionLLM: Multimodal Motion-Language Learning with Large Language Models	Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang	Recent advancements in Multimodal Large Language Models (MM-LLMs) have demonstrated promising potential in terms of generalization and robustness when applied to different modalities. While previous works have already achieved 3D human motion generation using various approaches including language modeling, they mostly % are mostly carefully designed use specialized architecture and are restricted to single-human motion generation. Inspired by the success of MM-LLMs, we propose MotionLLM, a simple and general framework that can achieve single-human, multi-human motion generation, and motion captioning by fine-tuning pre-trained LLMs. Specifically, we encode and quantize motions into discrete LLM-understandable tokens, which results in a unified vocabulary consisting of both motion and text tokens. With only 1--3% parameters of the LLMs trained by using adapters, our single-human motion generation achieves comparable results to those diffusion models and other trained-from-scratch transformer-based models. Additionally, we show that our approach is scalable and flexible, allowing easy extension to multi-human motion generation through autoregressive generation of single-human motions. Project page: https://knoxzhao.github.io/MotionLLM	Introduces MotionLLM, a simple and general framework for single/multi-human motion generation and motion captioning by fine-tuning pre-trained LLMs with motion-text unified vocabulary.	Addresses limitations of previous methods in handling semantically complex text and adapting to different motion-language tasks.	Encodes motions into discrete tokens using VQ-VAE or RVQ-VAE, combines motion tokens with text tokens to form a unified vocabulary for LLM fine-tuning using adapters.	Achieves competitive single-human motion generation results compared to diffusion models and other trained-from-scratch models. Outperforms state-of-the-art methods in motion captioning, generating semantically accurate and contextually appropriate descriptions. Demonstrates flexibility by extending to multi-human motion generation through autoregressive generation of single-human motions.	Long inference time due to the autoregressive nature of LLMs. Limited performance in multi-human motion generation due to data scarcity and complexity of motion language descriptions.	motion generation, motion captioning, multimodal learning, large language models, motion tokenization
2405.16947 Report	Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models	Qian Wang, Abdelrahman Eldesokey, Mohit Mendiratta, Fangneng Zhan, Adam Kortylewski, Christian Theobalt, Peter Wonka	We introduce the first zero-shot approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic correspondence and segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to the full resolution at a high quality. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches significantly on various VSS benchmarks without any training or fine-tuning. Moreover, it rivals supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS.	This paper introduces the first zero-shot approach for Video Semantic Segmentation (VSS) using pre-trained diffusion models, enhancing temporal consistency in video segmentation.	Existing diffusion-based image segmentation methods, when applied frame-by-frame to videos, lack temporal consistency due to the absence of temporal information modeling. This work addresses this gap by introducing a framework specifically designed for VSS.	The approach constructs a scene context model using diffusion features, which autoregressively updates to accommodate scene changes. It then employs a correspondence-based refinement strategy for temporal and spatial consistency. Finally, a masked modulation process generates full-resolution segmentation maps.	The method significantly outperforms existing zero-shot image semantic segmentation approaches on VSS benchmarks like VSPW, CityScapes, and Camvid. It achieves comparable performance to supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS. The study finds that features from Stable Diffusion (SD) currently produce better results than Stable Video Diffusion (SVD), potentially due to the smaller training dataset size for SVD.	The approach's performance is dependent on the quality of image inversion and VAE encoding, which can discard fine details. The method is instance-agnostic, grouping objects of the same class into a single cluster. Future work could explore Video Instance or Panoptic Segmentation.	video semantic segmentation, diffusion models, zero-shot learning, temporal consistency, scene context modeling
2405.16923 Report	SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain	Butian Xiong, Xiaoyu Ye, Tze Ho Elden Tse, Kai Han, Shuguang Cui, Zhen Li	With the emergence of Gaussian Splats, recent efforts have focused on large-scale scene geometric reconstruction. However, most of these efforts either concentrate on memory reduction or spatial space division, neglecting information in the semantic space. In this paper, we propose a novel method, named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats. Specifically, we leverage prior information stored in large vision models such as SAM and DINO to generate semantic masks. We then introduce a geometric complexity measurement function to serve as soft regularization, guiding the shape of each Gaussian Splat within specific semantic areas. Additionally, we present a method that estimates the expected number of Gaussian Splats in different semantic areas, effectively providing a lower bound for Gaussian Splats in these areas. Subsequently, we extract the point cloud using a novel probability density-based extraction method, transforming Gaussian Splats into a point cloud crucial for downstream tasks. Our method also offers the potential for detailed semantic inquiries while maintaining high image-based reconstruction results. We provide extensive experiments on publicly available large-scale scene reconstruction datasets with highly accurate point clouds as ground truth and our novel dataset. Our results demonstrate the superiority of our method over current state-of-the-art Gaussian Splats reconstruction methods by a significant margin in terms of geometric-based measurement metrics. Code and additional results will soon be available on our project page.	Introduces SA-GS, a novel method for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats.	Addresses limitations of existing 3D Gaussian Splatting (3DGS) methods that struggle with unrealistic geometric reconstruction, particularly in scenes with complex lighting.	Leverages semantic information from large vision models (e.g., SAM, DINO) to guide the shape and opacity of Gaussian Splats, effectively controlling geometric complexity and mitigating unrealistic surface generation.	Significantly improves geometric reconstruction accuracy compared to state-of-the-art methods like SuGaR and 2D Gaussian Splats. Effectively reduces memory consumption during training by dynamically adjusting the number of Gaussian Splats based on semantic and geometric complexity. Provides a hierarchical probability density sampling strategy for extracting detailed point clouds while mitigating the 'fantasy surface' problem.	Current implementation doesn't explicitly handle occlusion between Gaussian Splats during training. Reliance on user-provided semantic information can be a limitation.	3d reconstruction, gaussian splatting, semantic segmentation, point cloud extraction, large-scale scene reconstruction
2405.16915 Report	Multilingual Diversity Improves Vision-Language Representations	Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna	Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Our work questions this practice. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. By translating all multilingual image-text pairs from a raw web crawl to English and re-filtering them, we increase the prevalence of (translated) multilingual data in the resulting training set. Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet, ImageNet distribution shifts, image-English-text retrieval and on average across 38 tasks from the DataComp benchmark. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa. In addition, we quantitatively show that English and non-English data are significantly different in both image and (translated) text space. We hope that our findings motivate future work to be more intentional about including multicultural and multilingual data, not just when non-English or geographically diverse tasks are involved, but to enhance model capabilities at large.	This paper investigates whether incorporating multilingual data during pre-training can improve the performance of vision-language models on English vision tasks.	Existing vision-language datasets and models often exhibit a monolingual bias, limiting their ability to learn culturally diverse concepts and generalize to non-English tasks. This work explores the potential benefits of leveraging the diversity present in multilingual data to improve model capabilities on a broader range of tasks.	The authors translate a large web-crawled image-text dataset (DataComp) to English, re-filter it based on image-text alignment, and train a CLIP model on this translated multilingual data. They compare the performance of this model to models trained on English-only or English-dominated datasets on a range of English vision tasks.	Training on translated multilingual data outperforms training on English-only or English-dominated datasets on various English vision tasks, including ImageNet, ImageNet distribution shifts, and image-English-text retrieval. On the geographically diverse GeoDE task, training on translated multilingual data significantly improves accuracy across all regions, particularly in Africa. Analysis of the image and text distributions reveals significant differences between English and translated non-English data, indicating that they capture distinct and complementary information.	The study primarily focuses on data filtering based on image-text cosine similarity, and it remains unclear whether the observed benefits hold for other filtering methods. Translation may introduce artifacts and potentially reduce the richness of the original language. Future work can explore alternative approaches to effectively leverage multilingual data without relying solely on translation.	multilingual vision-language models, data diversity, cross-lingual transfer learning, vision-language pre-training, data curation
2405.16895 Report	Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image Generation	Liang Shi, Jie Zhang, Shiguang Shan	Text-to-image diffusion models, such as Stable Diffusion, generate highly realistic images from text descriptions. However, the generation of certain content at such high quality raises concerns. A prominent issue is the accurate depiction of identifiable facial images, which could lead to malicious deepfake generation and privacy violations. In this paper, we propose Anonymization Prompt Learning (APL) to address this problem. Specifically, we train a learnable prompt prefix for text-to-image diffusion models, which forces the model to generate anonymized facial identities, even when prompted to produce images of specific individuals. Extensive quantitative and qualitative experiments demonstrate the successful anonymization performance of APL, which anonymizes any specific individuals without compromising the quality of non-identity-specific image generation. Furthermore, we reveal the plug-and-play property of the learned prompt prefix, enabling its effective application across different pretrained text-to-image models for transferrable privacy and security protection against the risks of deepfakes.	This paper introduces Anonymization Prompt Learning (APL), a method to prevent text-to-image diffusion models from generating identifiable facial images of specific individuals, thereby mitigating deepfake risks and privacy concerns.	The ability of text-to-image models to create realistic images of identifiable faces raises serious ethical concerns about malicious deepfake generation and privacy violations.	APL trains a learnable prompt prefix (Anonymization Prompt) that, when prepended to any input prompt, forces the model to generate anonymized facial images if the prompt specifies an identity, while maintaining image quality and text fidelity for other prompts.	APL significantly reduces the accuracy of generated identities, effectively anonymizing faces even for individuals not seen during training. The learned Anonymization Prompt exhibits transferability, demonstrating effectiveness across different pretrained text-to-image models. APL preserves the overall quality of generated images and their alignment with text prompts, ensuring minimal impact on the model's general image generation capabilities.	The reliance on ChatGPT for generating attribute descriptions may introduce inaccuracies in training data. Further research can explore expanding APL to anonymize other sensitive attributes beyond facial features.	text-to-image generation, diffusion models, deepfakes, privacy protection, prompt learning
2405.16888 Report	Part123: Part-aware 3D Reconstruction from a Single-view Image	Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang Dou, Hao-Xiang Guo, Ping Luo, Wenping Wang	Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from large noises, unsmooth surfaces, and blurry textures, making it challenging to obtain satisfactory part segments using 3D segmentation techniques. In this paper, we present Part123, a novel framework for part-aware 3D reconstruction from a single-view image. We first use diffusion models to generate multiview-consistent images from a given image, and then leverage Segment Anything Model (SAM), which demonstrates powerful generalization ability on arbitrary objects, to generate multiview segmentation masks. To effectively incorporate 2D part-based information into 3D reconstruction and handle inconsistency, we introduce contrastive learning into a neural rendering framework to learn a part-aware feature space based on the multiview segmentation masks. A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models. Experiments show that our method can generate 3D models with high-quality segmented parts on various objects. Compared to existing unstructured reconstruction methods, the part-aware 3D models from our method benefit some important applications, including feature-preserving reconstruction, primitive fitting, and 3D shape editing.	This paper presents Part123, a novel framework for reconstructing a part-aware 3D model from a single-view image.	Part-based 3D models are crucial for many real-world applications, but existing single-view reconstruction methods neglect the part-based structure.	Part123 first generates multiview images using diffusion models and predicts their 2D segmentation masks with SAM. Then it uses contrastive learning in a neural rendering framework to learn part-aware features based on multiview masks. Finally, an automatic clustering-based algorithm is used to extract 3D part segmentation results.	Part123 can generate high-quality 3D models with meaningful part segments on various objects. The part-aware models from Part123 benefit applications such as feature-preserving reconstruction, primitive fitting, and shape editing. The method shows robustness to different numbers of multiview images and different generative models.	The accuracy of part segmentation relies on the quality of multiview images and 2D segmentation. The method currently only focuses on single objects without considering complex scenes.	3d reconstruction, part segmentation, diffusion models, contrastive learning, neural rendering
2405.16852 Report	EM Distillation for One-step Diffusion Models	Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao	While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.	This paper presents EM Distillation (EMD), a new method for distilling diffusion models into efficient one-step generators while maintaining high perceptual quality.	Diffusion models excel at learning complex distributions but suffer from slow sampling speeds. EMD addresses this by enabling fast, one-step generation with minimal quality loss.	EMD leverages an Expectation-Maximization (EM)-like framework. It introduces a novel reparametrized sampling scheme and a noise cancellation technique to stabilize and accelerate the distillation process.	EMD achieves state-of-the-art FID scores on one-step image generation for ImageNet 64x64 and 128x128. The method demonstrates the effectiveness of multi-step Langevin updates on both data and latent variables during distillation. EMD shows promising results on computationally expensive text-to-image generation by effectively distilling Stable Diffusion models.	EMD currently relies on initializing the student model from the teacher model for optimal performance. The method's reliance on multi-step sampling introduces additional computational cost during training.	diffusion models, generative models, knowledge distillation, image generation, text-to-image generation
2405.16849 Report	Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation	Zhoujie Fu, Jiacheng Wei, Wenhao Shen, Chaoyue Song, Xiaofeng Yang, Fayao Liu, Xulei Yang, Guosheng Lin	In this work, we introduce a novel approach for creating controllable dynamics in 3D-generated Gaussians using casually captured reference videos. Our method transfers the motion of objects from reference videos to a variety of generated 3D Gaussians across different categories, ensuring precise and customizable motion transfer. We achieve this by employing blend skinning-based non-parametric shape reconstruction to extract the shape and motion of reference objects. This process involves segmenting the reference objects into motion-related parts based on skinning weights and establishing shape correspondences with generated target shapes. To address shape and temporal inconsistencies prevalent in existing methods, we integrate physical simulation, driving the target shapes with matched motion. This integration is optimized through a displacement loss to ensure reliable and genuine dynamics. Our approach supports diverse reference inputs, including humans, quadrupeds, and articulated objects, and can generate dynamics of arbitrary length, providing enhanced fidelity and applicability. Unlike methods heavily reliant on diffusion video generation models, our technique offers specific and high-quality motion transfer, maintaining both shape integrity and temporal consistency.	This paper introduces Sync4D, a novel method for generating controllable dynamics in 3D-generated Gaussians by transferring motion from casually captured videos.	Existing methods for dynamic 3D content generation often struggle with inaccurate motion representations, shape inconsistency, and lack of precise motion control. Sync4D addresses these limitations by leveraging real-world video guidance and physical simulation.	The method involves shape reconstruction from the reference video, establishing shape correspondences between reference and target objects, and integrating physical simulation to drive the target shape with matched motion, optimized by a displacement loss.	Sync4D successfully transfers motion from various sources (humans, animals, objects) to diverse 3D Gaussian objects, ensuring high fidelity and customization across categories. The method maintains shape integrity and temporal consistency in generated dynamics, outperforming existing approaches relying on video diffusion models. By integrating physical simulation and optimizing with a displacement loss, Sync4D ensures realistic and plausible motions while minimizing cumulative errors.	Sync4D faces challenges transferring motion between objects with significantly different topologies. The initial pose of the reference video and generated 3D object cannot be substantially different due to the method's focus on relative motion learning.	4d generation, motion transfer, physical simulation, 3d gaussian, shape reconstruction
2405.16847 Report	TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction	Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu	Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce \textbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution, ideal for creating spatially correlated long sequences. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date and providing a unified benchmark for experimental validation. Leveraging the Mamba network inherently suited for long-sequence modeling on this dataset, TokenUnify not only reduces the computational complexity but also leads to a significant 45\% improvement in segmentation performance on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at \url{https://github.com/ydchen0806/TokenUnify}.	Introduces TokenUnify, a novel pretraining method for visual autoregression that integrates random token prediction, next-token prediction, and next-all token prediction.	Addresses the limitations of existing vision pretraining methods like masked autoencoders (scalability) and traditional autoregression (cumulative errors).	1. Proposes TokenUnify to mitigate cumulative errors in autoregression. 2. Introduces Mamba architecture for efficient long-sequence modeling. 3. Compiles a large-scale, ultra-high-resolution 3D electron microscopy (EM) dataset of mouse brain slices.	TokenUnify led to a 45% improvement in performance on EM neuron segmentation tasks. TokenUnify outperformed MAE by 21% in pretraining performance with fewer parameters. TokenUnify demonstrated superior scaling properties compared to MAE and traditional autoregressive methods.	Effectiveness on natural images and diverse downstream tasks needs further validation. Future work includes exploring model lightweighting and efficient fine-tuning strategies.	pretraining, vision models, autoregression, electron microscopy, segmentation
2405.16829 Report	PyGS: Large-scale Scene Representation with Pyramidal 3D Gaussian Splatting	Zipeng Wang, Dan Xu	Neural Radiance Fields (NeRFs) have demonstrated remarkable proficiency in synthesizing photorealistic images of large-scale scenes. However, they are often plagued by a loss of fine details and long rendering durations. 3D Gaussian Splatting has recently been introduced as a potent alternative, achieving both high-fidelity visual results and accelerated rendering performance. Nonetheless, scaling 3D Gaussian Splatting is fraught with challenges. Specifically, large-scale scenes grapples with the integration of objects across multiple scales and disparate viewpoints, which often leads to compromised efficacy as the Gaussians need to balance between detail levels. Furthermore, the generation of initialization points via COLMAP from large-scale dataset is both computationally demanding and prone to incomplete reconstructions. To address these challenges, we present Pyramidal 3D Gaussian Splatting (PyGS) with NeRF Initialization. Our approach represent the scene with a hierarchical assembly of Gaussians arranged in a pyramidal fashion. The top level of the pyramid is composed of a few large Gaussians, while each subsequent layer accommodates a denser collection of smaller Gaussians. We effectively initialize these pyramidal Gaussians through sampling a rapidly trained grid-based NeRF at various frequencies. We group these pyramidal Gaussians into clusters and use a compact weighting network to dynamically determine the influence of each pyramid level of each cluster considering camera viewpoint during rendering. Our method achieves a significant performance leap across multiple large-scale datasets and attains a rendering time that is over 400 times faster than current state-of-the-art approaches.	This paper introduces PyGS, a novel multi-scale 3D Gaussian Splatting framework designed for efficient and detailed large-scale scene representation.	Existing NeRF-based methods struggle with fine detail rendering and speed in large scenes, while 3D Gaussian Splatting faces challenges with multi-scale objects and slow initialization in such settings.	PyGS utilizes a hierarchical structure of 3D Gaussians, organized into pyramid levels for multi-scale detail capture. It initializes these Gaussians efficiently using a coarsely trained grid-based NeRF and dynamically adjusts level weights during rendering via a compact weighting network informed by camera viewpoint and cluster embeddings.	PyGS outperforms state-of-the-art NeRF-based methods and original 3DGS across various metrics on four large-scale datasets, achieving high-fidelity results with a significant speed boost. NeRF-based initialization proves superior to random or COLMAP-based methods, yielding denser point clouds with better geometric details. The adaptive weighting strategy significantly enhances rendering quality compared to simpler alternatives.	Modeling even larger environments necessitates further exploration of parallel optimization techniques due to substantial memory and computational demands. Future research can investigate the application of PyGS in related domains, such as 3D reconstruction, scene editing, and virtual reality.	neural radiance fields, 3d gaussian splatting, large-scale scene representation, multi-scale modeling, novel view synthesis
2405.16823 Report	Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection	Gihyun Kwon, Jangho Park, Jong Chul Ye	While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.	This paper proposes a novel unified editing method that enables seamless editing across panorama images, videos, and 3D scenes using only a single 2D image text-to-image diffusion model.	Existing text-to-image models often require separate models for different modalities (3D, video, panorama), leading to difficulty in attribute editing and higher resource consumption. This method aims to overcome these challenges by using a single 2D model for all.	The method leverages the sequential nature of images in different modalities. It combines the strengths of single image editing (using self-attention injection) and sequential image editing (using shared attention) by employing two parallel paths: disentangled editing on a reference image and context transfer using shared self-attention features.	Outperforms baseline methods in 3D scene editing, achieving superior semantic object editing and overall style transfer while preserving scene structure. Successfully edits panorama images, demonstrating better text alignment and structural consistency compared to existing techniques. Achieves impressive results in video editing, showing superior text-guided semantic changes and cross-frame consistency.	Maintaining consistency can be challenging when the semantic distance between sequential frames is significantly large. The ability to edit using inappropriate text prompts raises ethical concerns.	text-to-image, diffusion models, image editing, 3d scene editing, video editing, panorama editing
2405.16822 Report	Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels	Yikai Wang, Xinzhou Wang, Zilong Chen, Zhengyi Wang, Fuchun Sun, Jun Zhu	Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these models are also observed to exhibit strong 3D consistency, significantly enhancing their potential to act as world simulators. In this work, we present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D (i.e., sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity and frame distortion. This capability is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. At the core of Vidu4D is our proposed Dynamic Gaussian Surfels (DGS) technique. DGS optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. This transformation enables a precise depiction of motion and deformation over time. To preserve the structural integrity of surface-aligned Gaussian surfels, we design the warped-state geometric regularization based on continuous warping fields for estimating normals. Additionally, we learn refinements on rotation and scaling parameters of Gaussian surfels, which greatly alleviates texture flickering during the warping process and enhances the capture of fine-grained appearance details. Vidu4D also contains a novel initialization state that provides a proper start for the warping fields in DGS. Equipping Vidu4D with an existing video generative model, the overall framework demonstrates high-fidelity text-to-4D generation in both appearance and geometry.	Introduces Vidu4D, a novel reconstruction model that generates accurate 4D representations from single generated videos, addressing challenges like non-rigidity and frame distortion.	Enables creation of high-fidelity virtual content with strong spatial and temporal coherence, crucial for VR, visualization, and AI.	Utilizes Dynamic Gaussian Surfels (DGS), optimizing time-varying warping functions for transforming Gaussian surfels to depict motion and deformation. Incorporates warped-state normal regularization and refinement of Gaussian surfel parameters for accurate geometry and appearance.	Achieves superior novel-view reconstruction compared to state-of-the-art methods in terms of detail preservation, texture quality, and geometric accuracy. Quantitative evaluation shows significant improvements in PSNR, SSIM, and LPIPS metrics. Ablation studies confirm the effectiveness of warped-state regularization and refinement strategies in DGS.	Current limitations include dependence on video quality, scalability for large scenes, and computational demands for real-time applications. Future work will address these limitations and explore applications in content creation and editing.	4d reconstruction, video generation, dynamic gaussian surfels, non-rigid deformation, text-to-4d generation
2405.16803 Report	TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing	Xinyu Zhang, Mengxue Kang, Fei Wei, Shuang Xu, Yuhe Liu, Lin Ma	As the field of image generation rapidly advances, traditional diffusion models and those integrated with multimodal large language models (LLMs) still encounter limitations in interpreting complex prompts and preserving image consistency pre and post-editing. To tackle these challenges, we present an innovative image editing framework that employs the robust Chain-of-Thought (CoT) reasoning and localizing capabilities of multimodal LLMs to aid diffusion models in generating more refined images. We first meticulously design a CoT process comprising instruction decomposition, region localization, and detailed description. Subsequently, we fine-tune the LISA model, a lightweight multimodal LLM, using the CoT process of Multimodal LLMs and the mask of the edited image. By providing the diffusion models with knowledge of the generated prompt and image mask, our models generate images with a superior understanding of instructions. Through extensive experiments, our model has demonstrated superior performance in image generation, surpassing existing state-of-the-art models. Notably, our model exhibits an enhanced ability to understand complex prompts and generate corresponding images, while maintaining high fidelity and consistency in images before and after generation.	This paper proposes a novel image editing framework leveraging the reasoning and localizing capabilities of multimodal LLMs to enhance diffusion models for generating high-fidelity images from complex textual prompts.	Current diffusion models and those integrated with LLMs face challenges in interpreting complex prompts and preserving image consistency pre- and post-editing. This work aims to address these limitations for more sophisticated and accurate image generation.	The framework utilizes a Chain-of-Thought (CoT) process comprising instruction decomposition, region localization, and detailed description. It fine-tunes a lightweight multimodal LLM (LISA) with CoT data from GPT-4V and employs it to generate precise masks and inpainting prompts for a diffusion-based inpainting model.	The model demonstrates superior performance in following complex instructions for image editing compared to existing state-of-the-art models. It generates images with high fidelity, preserving the content of the original image while accurately modifying the specified regions. The framework proves to be both effective and efficient, benefiting from the reasoning abilities of LLMs and the fine-tuned LISA model's performance and speed.	The work is limited by the quantity and quality of the training dataset, which restricts the model's ability to generate precise, object-level masks. The inpainting quality heavily relies on the prompt descriptions and the inherent randomness of diffusion models, affecting consistency.	image editing, diffusion models, multimodal llms, chain-of-thought, high-fidelity generation
2405.16788 Report	3D Reconstruction with Fast Dipole Sums	Hanyu Chen, Bailey Miller, Ioannis Gkioulekas	We introduce a technique for the reconstruction of high-fidelity surfaces from multi-view images. Our technique uses a new point-based representation, the dipole sum, which generalizes the winding number to allow for interpolation of arbitrary per-point attributes in point clouds with noisy or outlier points. Using dipole sums allows us to represent implicit geometry and radiance fields as per-point attributes of a point cloud, which we initialize directly from structure from motion. We additionally derive Barnes-Hut fast summation schemes for accelerated forward and reverse-mode dipole sum queries. These queries facilitate the use of ray tracing to efficiently and differentiably render images with our point-based representations, and thus update their point attributes to optimize scene geometry and appearance. We evaluate this inverse rendering framework against state-of-the-art alternatives, based on ray tracing of neural representations or rasterization of Gaussian point-based representations. Our technique significantly improves reconstruction quality at equal runtimes, while also supporting more general rendering techniques such as shadow rays for direct illumination. In the supplement, we provide interactive visualizations of our results.	This paper introduces "dipole sum," a novel point-based representation for reconstructing high-fidelity surfaces from multi-view images using an inverse rendering framework.	Existing neural rendering techniques often struggle with high computational costs and difficulties leveraging 3D information from structure from motion. This paper addresses these limitations by enabling efficient and direct utilization of point clouds for high-quality surface reconstruction.	The methodology involves generalizing the winding number concept to allow interpolation of attributes in noisy point clouds, using this to represent geometry and radiance fields, and leveraging Barnes-Hut fast summation for efficient computation and backpropagation during inverse rendering.	The proposed technique significantly improves reconstruction quality at equal runtimes compared to state-of-the-art alternatives like neural and Gaussian representations. It supports more general rendering techniques such as shadow rays, enhancing the accuracy of direct illumination. The method directly leverages and refines point clouds from structure from motion, improving efficiency and detail in surface reconstruction.	The paper acknowledges difficulties in accurately reconstructing surfaces with strong specular reflections, highlighting a need for improved handling of such appearances. While the method demonstrates the potential for use with advanced rendering algorithms like path tracing, further investigation is needed to fully explore these capabilities.	winding number, point-based modeling, inverse rendering, 3d reconstruction, ray tracing
2405.16785 Report	PromptFix: You Prompt and We Fix the Photo	Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, Jiebo Luo	Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks. The dataset and code will be aviliable at https://github.com/yeates/PromptFix.	This paper proposes PromptFix, a novel diffusion-based model with an accompanying large-scale visual-instruction training dataset, aimed at improving instruction-guided low-level image processing.	Existing instruction-following datasets lack diversity and struggle with low-level tasks, hindering the development of effective models for detailed image processing.	PromptFix leverages High-frequency Guidance Sampling to preserve spatial details and a VLM-based Auxiliary Prompt Module to enhance semantic understanding and adapt to severe image degradation.	PromptFix demonstrates superior performance in instruction-based image processing tasks, surpassing existing methods in colorization, watermark removal, and object removal. The model exhibits strong zero-shot capabilities, effectively handling blind restoration for low-light enhancement, desnowing, and dehazing. PromptFix excels in multi-task processing, demonstrating the ability to address multiple low-level tasks within a single image.	Blind restoration using PromptFix occasionally leads to out-of-conditioned image control, highlighting the need for user-specified instructions when possible. While High-frequency Guidance Sampling enhances detail preservation, it can slightly reduce overall image quality.	image processing, diffusion models, vision-language models, image restoration, instruction following
2405.16645 Report	Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models	Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei	The availability of large-scale multimodal datasets and advancements in diffusion models have significantly accelerated progress in 4D content generation. Most prior approaches rely on multiple image or video diffusion models, utilizing score distillation sampling for optimization or generating pseudo novel views for direct supervision. However, these methods are hindered by slow optimization speeds and multi-view inconsistency issues. Spatial and temporal consistency in 4D geometry has been extensively explored respectively in 3D-aware diffusion models and traditional monocular video diffusion models. Building on this foundation, we propose a strategy to migrate the temporal consistency in video diffusion models to the spatial-temporal consistency required for 4D generation. Specifically, we present a novel framework, \textbf{Diffusion4D}, for efficient and scalable 4D content generation. Leveraging a meticulously curated dynamic 3D dataset, we develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. To control the dynamic strength of these assets, we introduce a 3D-to-4D motion magnitude metric as guidance. Additionally, we propose a novel motion magnitude reconstruction loss and 3D-aware classifier-free guidance to refine the learning and generation of motion dynamics. After obtaining orbital views of the 4D asset, we perform explicit 4D construction with Gaussian splatting in a coarse-to-fine manner. The synthesized multi-view consistent 4D image set enables us to swiftly generate high-fidelity and diverse 4D assets within just several minutes. Extensive experiments demonstrate that our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency across various prompt modalities.	Presents Diffusion4D, a novel framework for efficient and consistent 4D content generation leveraging 4D-aware video diffusion models and explicit 4D construction.	Addresses the limitations of existing 4D generation methods, such as slow optimization speed and multi-view inconsistency, aiming for efficient and consistent generation of dynamic 3D content.	1. Curates a large-scale, high-quality 4D dataset from existing 3D datasets. 2. Develops a 4D-aware video diffusion model to synthesize orbital views of dynamic 3D assets, incorporating a 3D-to-4D motion magnitude metric and guidance. 3. Performs explicit 4D construction using Gaussian splatting with a coarse-to-fine strategy.	Achieves state-of-the-art performance in text-to-4D and image-to-4D generation, outperforming baselines in terms of generation efficiency and 4D geometry consistency. Successfully generates dynamic 3D assets from static 3D content, demonstrating the versatility of the framework. Shows significant improvement in quantitative metrics (CLIP, LPIPS, PSNR, SSIM, FVD) and qualitative evaluations (user study) compared to existing methods.	Current implementation uses a limited video resolution and temporal sequence length. Dataset diversity and quality can be further improved.	4d content generation, video diffusion models, 3d-to-4d motion magnitude, gaussian splatting, spatial-temporal consistency
2405.16605 Report	Demystify Mamba in Vision: A Linear Attention Perspective	Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang	Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba's success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https://github.com/LeapLabTHU/MLLA.	This paper reveals the close relationship between the efficient Mamba model and the linear attention Transformer, analyzing their similarities and disparities to understand the key factors behind Mamba's effectiveness.	Mamba has shown impressive performance in various vision tasks with linear computation complexity, but it surprisingly shares similarities with the less effective linear attention Transformer, demanding an investigation into the reasons behind this difference.	The paper reformulates selective state space model (Mamba) and linear attention within a unified framework, identifying six distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. The impact of each distinction on model performance is then empirically evaluated through ablations on vision tasks.	The forget gate and block design are identified as the core contributors to Mamba's superior performance. The forget gate, while effective, necessitates recurrent computation that might not be ideal for vision models and can be replaced by suitable positional encoding. A novel Mamba-Like Linear Attention (MLLA) model, incorporating the merits of Mamba's design into linear attention, outperforms various vision Mamba models in image classification and dense prediction tasks, while enabling parallelizable computation.	The analysis might not cover all subtle implementation differences between Mamba and linear attention Transformer. Future work can investigate alternative parallelizable mechanisms to replace the forget gate for improved performance.	mamba, linear attention, transformer, vision transformer, state space model
2405.16596 Report	Protect-Your-IP: Scalable Source-Tracing and Attribution against Personalized Generation	Runyi Li, Xuanyu Zhang, Zhipei Xu, Yongbing Zhang, Jian Zhang	With the advent of personalized generation models, users can more readily create images resembling existing content, heightening the risk of violating portrait rights and intellectual property (IP). Traditional post-hoc detection and source-tracing methods for AI-generated content (AIGC) employ proactive watermark approaches; however, these are less effective against personalized generation models. Moreover, attribution techniques for AIGC rely on passive detection but often struggle to differentiate AIGC from authentic images, presenting a substantial challenge. Integrating these two processes into a cohesive framework not only meets the practical demands for protection and forensics but also improves the effectiveness of attribution tasks. Inspired by this insight, we propose a unified approach for image copyright source-tracing and attribution, introducing an innovative watermarking-attribution method that blends proactive and passive strategies. We embed copyright watermarks into protected images and train a watermark decoder to retrieve copyright information from the outputs of personalized models, using this watermark as an initial step for confirming if an image is AIGC-generated. To pinpoint specific generation techniques, we utilize powerful visual backbone networks for classification. Additionally, we implement an incremental learning strategy to adeptly attribute new personalized models without losing prior knowledge, thereby enhancing the model's adaptability to novel generation methods. We have conducted experiments using various celebrity portrait series sourced online, and the results affirm the efficacy of our method in source-tracing and attribution tasks, as well as its robustness against knowledge forgetting.	This paper proposes a novel framework for source-tracing and attribution of personalized generated images, employing a combination of proactive watermarking and passive detection mechanisms.	The rise of personalized AI image generation models poses significant threats to portrait rights and intellectual property (IP) by enabling easy creation of images resembling existing content.	This work embeds copyright watermarks into protected images using a box-free watermarking technique. These watermarks are detectable even after images are processed by personalized generation models, allowing for source-tracing. For attribution, a hierarchical approach is proposed, first detecting the presence of the watermark and then classifying the specific generation method using a visual backbone network. An incremental learning strategy is also incorporated for adaptable attribution of newly emerging generation methods.	The proposed watermarking method effectively embeds copyright information while preserving image quality, outperforming the compared method. The combined proactive and passive attribution approach achieves high accuracy in both detecting AI-generated content and identifying the specific generation method. The implemented incremental learning strategy effectively updates the attribution model for new generation methods while mitigating catastrophic forgetting.	The dataset used for training and validation could be more extensive. The attribution approach currently requires an extra training process for new generation methods, and developing a more flexible and self-adaptive approach would be beneficial.	ai-generated content, copyright protection, source-tracing, attribution, watermarking
2405.16570 Report	ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling	Francesca Babiloni, Alexandros Lattas, Jiankang Deng, Stefanos Zafeiriou	We propose ID-to-3D, a method to generate identity- and text-guided 3D human heads with disentangled expressions, starting from even a single casually captured in-the-wild image of a subject. The foundation of our approach is anchored in compositionality, alongside the use of task-specific 2D diffusion models as priors for optimization. First, we extend a foundational model with a lightweight expression-aware and ID-aware architecture, and create 2D priors for geometry and texture generation, via fine-tuning only 0.2% of its available training parameters. Then, we jointly leverage a neural parametric representation for the expressions of each subject and a multi-stage generation of highly detailed geometry and albedo texture. This combination of strong face identity embeddings and our neural representation enables accurate reconstruction of not only facial features but also accessories and hair and can be meshed to provide render-ready assets for gaming and telepresence. Our results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation, generalizing to a ``world'' of unseen 3D identities, without relying on large 3D captured datasets of human assets.	ID-to-3D: a method for generating identity- and text-guided 3D human heads with disentangled expressions from a single in-the-wild image.	Existing methods struggle to generate high-quality 3D head avatars with personalized identity and expressions due to limitations in 3D data and disentangling geometry, texture, and lighting.	The method leverages compositionality and task-specific 2D diffusion models as priors. It uses ArcFace embeddings for identity, a neural parametric representation for expressions, and a two-stage Score Distillation Sampling pipeline for generating geometry and albedo texture.	Outperforms text-based and image-based SDS baselines in generating 3D heads with superior geometric details and texture quality. Generates a wide variety of ID-consistent expressions, captured by latent codes. Allows for ID-consistent editing of geometry and appearance using text prompts.	Generalization capacity is limited by the used face embedding network and diffusion model, potentially introducing biases. Lack of specific optimization for physically bounded textures and geometries might occasionally produce unnatural facial characteristics.	3d head generation, score distillation sampling, identity-consistent, expressive avatars, diffusion models
2405.16567 Report	Automatic Jailbreaking of the Text-to-Image Generative AI Systems	Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang	Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.	This paper proposes an Automated Prompt Generation Pipeline (APGP) to evaluate and expose the risk of copyright infringement in commercial text-to-image (T2I) generation systems.	Despite the advancement of AI systems and their integration into commercial T2I platforms, the risk of copyright infringement remains a significant concern, and current systems lack robust evaluation mechanisms.	The APGP leverages large language models (LLMs) to generate high-risk prompts from target images by optimizing a self-generated QA score and incorporating keyword penalties to bypass safety guards.	The study reveals that most commercial T2I systems, including Midjourney, Gemini, and Copilot, exhibit a high likelihood of copyright violation even with simple prompts. ChatGPT, while initially appearing more secure, is also vulnerable to copyright infringement when tested with APGP-generated prompts, achieving a 76% violation rate. Simple defense mechanisms, such as copyright detection filtering and concept unlearning models, prove inadequate in mitigating the risks highlighted by the APGP.	The violation rate can fluctuate due to the inherent randomness of commercial T2I systems. The paper's focus on copyright infringement analysis is primarily technical, lacking a comprehensive legal perspective on the observed violations.	copyright infringement, text-to-image generation, jailbreaking, ai safety, large language models
2405.16555 Report	vHeat: Building Vision Models upon Heat Conduction	Zhaozhi Wang, Yue Liu, Yunfan Liu, Hongtian Yu, Yaowei Wang, Qixiang Ye, Yunjie Tian	A fundamental problem in learning robust and expressive visual representations lies in efficiently estimating the spatial relationships of visual semantics throughout the entire image. In this study, we propose vHeat, a novel vision backbone model that simultaneously achieves both high computational efficiency and global receptive field. The essential idea, inspired by the physical principle of heat conduction, is to conceptualize image patches as heat sources and model the calculation of their correlations as the diffusion of thermal energy. This mechanism is incorporated into deep models through the newly proposed module, the Heat Conduction Operator (HCO), which is physically plausible and can be efficiently implemented using DCT and IDCT operations with a complexity of $\mathcal{O}(N^{1.5})$. Extensive experiments demonstrate that vHeat surpasses Vision Transformers (ViTs) across various vision tasks, while also providing higher inference speeds, reduced FLOPs, and lower GPU memory usage for high-resolution images. The code will be released at https://github.com/MzeroMiko/vHeat.	This paper introduces vHeat, a novel vision backbone model inspired by the physical principle of heat conduction, achieving both high computational efficiency and global receptive field.	Existing vision models, including CNNs, ViTs, and SSMs, struggle to balance computational complexity with the ability to capture long-range dependencies in images. vHeat addresses this challenge by modeling the propagation of visual semantics as heat diffusion.	vHeat leverages the Heat Conduction Operator (HCO), which simulates visual heat conduction using 2D DCT and IDCT operations. This approach offers an interpretable mechanism for global information propagation with a complexity of O(N^1.5).	vHeat outperforms benchmark models like ConvNeXt and Swin Transformers in image classification, object detection, and semantic segmentation tasks. vHeat demonstrates superior computational efficiency, exhibiting higher inference speeds, reduced FLOPs, and lower GPU memory usage, particularly for high-resolution images. Visualization analysis confirms vHeat's ability to establish global receptive fields and adapt its visual heat conduction based on image content.	The training process of vHeat can be challenging when long-range information conduction is required, demanding extensive training for effective long-range dependency learning. A dedicated self-supervised learning method tailored for vHeat, similar to masked image modeling for ViTs, is yet to be developed.	vision backbone, heat conduction, global receptive field, computational efficiency, image classification
2405.16537 Report	I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models	Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, Xingang Pan	The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.	Presents I2VEdit, a framework for fine-grained video editing that propagates user-made edits from the first frame to the whole video using a pre-trained image-to-video model.	Bridges the gap between advanced image editing tools and the limited capabilities of current video editing methods by leveraging the strength of image editing tools for video editing.	Employs a two-stage pipeline: 1) Coarse Motion Extraction: learns motion patterns from the source video using LoRA and skip-interval cross-attention. 2) Appearance Refinement: fine-tunes appearance and motion using attention matching, enhanced by smooth area random perturbation (SARP) during latent inversion.	Outperforms text-guided video editing and traditional image-guided methods in terms of editing quality, motion preservation, and appearance consistency. Demonstrates strong performance on various tasks, including local editing, global style transfer, and identity manipulation. Smooth area random perturbation (SARP) effectively addresses issues related to smooth regions during latent inversion, resulting in significant quality improvement.	May produce minor color and texture inconsistencies in unedited areas. Editing quality may degrade for videos with significant content change across clips.	video editing, diffusion models, image-to-video generation, attention mechanism, low-rank adaptation
2405.16534 Report	Pruning for Robust Concept Erasing in Diffusion Models	Tianyun Yang, Juan Cao, Chang Xu	Despite the impressive capabilities of generating images, text-to-image diffusion models are susceptible to producing undesirable outputs such as NSFW content and copyrighted artworks. To address this issue, recent studies have focused on fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce the undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in the current approaches and may raise risks for the deployment of diffusion models in the open world. To address this gap, we locate the concept-correlated neurons and find that these neurons show high sensitivity to adversarial prompts, thus could be deactivated when erasing and reactivated again under attacks. To improve the robustness, we introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs, achieving nearly a 40% improvement in erasing the NSFW content and a 30% improvement in erasing artwork style.	This paper introduces a novel pruning-based strategy for concept erasing in text-to-image diffusion models, which enhances robustness against adversarial prompts.	Existing concept erasing methods are vulnerable to adversarial prompts that can regenerate supposedly erased content, posing risks for real-world deployment of diffusion models.	The method identifies concept-correlated neurons sensitive to adversarial prompts and uses a differentiable pruning strategy guided by the concept erasing objective to selectively prune parameters, reducing neuron sensitivity.	The approach significantly improves robustness against adversarial attacks in erasing nudity, art styles, and objects. Pruning with erasing is found to be more effective than pruning before or after erasing. The method maintains good image generation quality for non-erased concepts.	The concept neuron identification relies on a numerical criterion that may be sensitive to the erased model selection. Future work includes exploring more accurate concept neuron identification and investigating the potential for developing more sophisticated attack strategies.	diffusion models, concept erasing, pruning, robustness, adversarial prompts
2405.16517 Report	Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors	Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen	We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.	Introduces SparseSplat360, a method for reconstructing 360° 3D scenes from sparse views using latent diffusion models to generate pseudo novel views.	Sparse-view 3D reconstruction is challenging due to limited information and traditional methods struggle with artifacts and missing details.	SparseSplat360 employs a two-step process using 2D diffusion models for in-painting missing regions and removing artifacts in rendered novel views. These improved views iteratively refine a 3D Gaussian representation of the scene.	Outperforms existing sparse-view reconstruction methods in 360° scene reconstruction. Generates entire 360° scenes from as few as 9 input views with high detail. Significantly faster and more data-efficient than methods relying on large-scale 3D datasets.	Limited by the accuracy of the initial sparse point cloud from SfM. Future work includes incorporating stronger geometry cues from 3D vision foundation models.	3d reconstruction, sparse view synthesis, diffusion models, generative priors, 3d gaussian splatting
2405.16504 Report	A Unified Implicit Attention Formulation for Gated-Linear Recurrent Sequence Models	Itamar Zimerman, Ameen Ali, Lior Wolf	Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the first to provide such a view, our method is effective and competitive in the relevant metrics compared to the results obtained by state-of-the-art transformer explainability methods. Our code is publicly available.	This paper presents a unified view of attention-free sequence models like Mamba, RWKV, and Griffin as implicit causal self-attention layers, enabling explainability methods for these architectures.	This unified view facilitates comparisons between transformer and non-transformer architectures and enables the development of new explainability and interpretability techniques for non-transformer models, crucial for understanding aspects like robustness, bias, and fairness.	The authors mathematically formulate the layers of these models (Mamba, RWKV, Griffin) as data-controlled linear operators, effectively representing them as implicit attention mechanisms. This approach involves analyzing the token mixing components, incorporating elements like gate branches and convolutional layers.	The implicit attention matrices derived from Mamba, Griffin, and RWKV exhibit patterns similar to traditional transformers, particularly in capturing long-range dependencies. The proposed attention representation leads to more accurate and interpretable attention maps compared to previous formulations, as demonstrated by visualization and superior performance in segmentation tests. Ablation studies confirm the importance of incorporating all architectural components (e.g., gate branches, convolutional layers) in the unified attention representation for optimal performance.	The paper primarily focuses on Mamba, RWKV, and Griffin, with potential to extend the framework to other architectures like Hyena and HGRN2. Future work could explore how differences in these architectures are reflected in their self-attention matrices to reveal more about their inductive biases.	self-attention, explainable ai (xai), sequence modeling, non-transformer architectures, mamba, rwkv, griffin
2405.16501 Report	User-Friendly Customized Generation with Multi-Modal Prompts	Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, Liqing Zhang	Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at $\href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$.	This paper proposes a user-friendly paradigm for customized text-to-image generation that simplifies user interaction by requiring only a single image per visual concept and accompanying text.	Existing methods often need multiple images per concept and struggle to capture intricate details of complex objects. This paradigm addresses these limitations by enhancing user-friendliness and customization granularity.	The method leverages a two-stage process: 1) extracting descriptions of main objects from user-provided images using BLIP for image captioning and ChatGPT for semantic analysis, and 2) finetuning a diffusion model with these descriptions to enable customized image generation based on user prompts.	The proposed paradigm outperforms existing methods in detailed customization of complex objects, as evidenced by qualitative comparisons. Quantitative evaluations using DINO score, CLIP-I score, and CLIP-T score demonstrate the superior performance of the paradigm in both image and text alignment. Human preference studies confirm that users prefer the proposed method over traditional approaches for both image and text alignment.	The current implementation shows limitations in handling multi-image scenarios due to constraints of existing stable diffusion models. The current definition of multi-modal prompts is restricted to customizing main objects, limiting broader semantic understanding and customization.	text-to-image generation, image customization, multi-modal prompts, diffusion models, user-friendly interface
2405.16470 Report	Image Deraining with Frequency-Enhanced State Space Model	Shugo Yamashita, Masaaki Ikehara	Removing rain artifacts in images is recognized as a significant issue. In this field, deep learning-based approaches, such as convolutional neural networks (CNNs) and Transformers, have succeeded. Recently, State Space Models (SSMs) have exhibited superior performance across various tasks in both natural language processing and image processing due to their ability to model long-range dependencies. This study introduces SSM to rain removal and proposes a Deraining Frequency-Enhanced State Space Model (DFSSM). To effectively remove rain streaks, which produce high-intensity frequency components in specific directions, we employ frequency domain processing concurrently with SSM. Additionally, we develop a novel mixed-scale gated-convolutional block, which uses convolutions with multiple kernel sizes to capture various scale degradations effectively and integrates a gating mechanism to manage the flow of information. Finally, experiments on synthetic and real-world rainy image datasets show that our method surpasses state-of-the-art methods.	This paper proposes DFSSM, a novel deraining model based on State Space Models (SSMs) that effectively removes rain artifacts from images by incorporating frequency domain processing.	Rain artifacts in images can severely degrade the performance of vision-based systems. Removing these artifacts is crucial for improving the quality and reliability of such systems.	The DFSSM leverages SSMs to capture long-range dependencies and employs a Frequency-Enhanced State Space Block (FSSB) for efficient rain streak removal. It also introduces a Mixed-Scale Gated-Convolutional Block (MGCB) to handle various scales of rain degradations and manage the flow of information within the network. The model is trained with L1 loss and Frequency Reconstruction loss.	DFSSM outperforms state-of-the-art deraining methods on both synthetic (Rain200H, Rain200L) and real-world (SPA-Data) datasets. Frequency domain processing through FFTM and the use of MGCB are shown to be effective for rain removal. Ablation studies demonstrate the contribution of each component in DFSSM to the overall performance gain.	The inference time of DFSSM is currently slower than some compared Transformer-based methods, potentially due to the lack of optimized implementation for SSMs. Future work could focus on further improving the model efficiency and exploring the application of DFSSM in video deraining tasks.	image deraining, state space models, frequency domain processing, deep learning, computer vision
2405.16401 Report	Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning	Neha Kalibhat, Priyatham Kattakinda, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi	Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).	This paper proposes using semantically meaningful visual tokens, extracted from off-the-shelf segmentation and scene-graph models, to improve representation learning in vision transformers.	The authors hypothesize that the standard practice of patchifying images into uniformly-sized chunks limits the model's ability to learn comprehensive and compositional representations.	The authors extract tangible tokens (instance segmentation masks) and intangible tokens (relationships and actions) using SEEM and RAM. They pre-train a vision transformer by incorporating these tokens and aligning the resulting embeddings with caption embeddings from a text-side encoder. Additive attention weights, based on structural and semantic relationships, are introduced to enhance representation learning.	The proposed method achieves a 47% improvement in text-to-image retrieval accuracy over a standard ViT and 9% over a fine-tuned CLIP model on COCO. The learned representations show improved compositional reasoning capabilities, outperforming a ViT by 18% on the ARO benchmark and 10% on the Winoground benchmark. Using additive attention based on semantic relationships and relative positions further enhances performance on compositionality benchmarks.	Pre-processing images to extract tokens introduces computational and memory overhead. The scalability of the approach to larger datasets and more complex scenes needs further investigation.	vision transformers, tokenization, semantic segmentation, scene graphs, compositional reasoning
2405.16393 Report	Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation	Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui	Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.	This paper proposes a novel video generation method that decouples foreground and background motion representation, enabling the generation of videos with dynamic backgrounds, unlike previous methods that mainly focused on animating foreground figures against static backgrounds.	Most existing human video synthesis methods generate videos with static backgrounds, which contradicts real-world scenarios where backgrounds are often dynamic. This limits the realism of generated videos.	The proposed method utilizes pose estimation to capture foreground (human) motion and sparse tracking points to model background motion. It employs a clip-by-clip generation strategy with condition concatenation and global feature extraction to generate longer videos without accumulating errors.	The method successfully generates realistic human videos with natural foreground motion and believable background dynamics, outperforming previous state-of-the-art methods on benchmark datasets. Qualitative and quantitative evaluations demonstrate the superior performance of the proposed method in terms of visual quality, motion fidelity, and temporal coherence. Ablation studies confirm the effectiveness of each proposed component, including foreground and background motion representation, condition concatenation, and global feature extraction.	The method's performance depends on the accuracy of the pose estimation and tracking point extraction techniques used. The use of sparse tracking points may not capture the full complexity of background motion. Increasing the number of tracking points could improve this but at a computational cost.	video generation, diffusion models, motion representation, dynamic backgrounds, long video synthesis
2405.16341 Report	R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model	Changhoon Kim, Kyle Min, Yezhou Yang	In the evolving landscape of text-to-image (T2I) diffusion models, the remarkable capability to generate high-quality images from textual descriptions faces challenges with the potential misuse of reproducing sensitive content. To address this critical issue, we introduce Robust Adversarial Concept Erase (RACE), a novel approach designed to mitigate these risks by enhancing the robustness of concept erasure method for T2I models. RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings, significantly reducing the Attack Success Rate (ASR). Impressively, RACE achieves a 30 percentage point reduction in ASR for the ``nudity'' concept against the leading white-box attack method. Our extensive evaluations demonstrate RACE's effectiveness in defending against both white-box and black-box attacks, marking a significant advancement in protecting T2I diffusion models from generating inappropriate or misleading imagery. This work underlines the essential need for proactive defense measures in adapting to the rapidly advancing field of adversarial challenges.	The paper introduces RACE (Robust Adversarial Concept Erase), a novel method to enhance the robustness of concept erasure in text-to-image diffusion models against adversarial attacks aiming to regenerate erased content.	Existing concept erasure techniques, while effective in removing sensitive content, are vulnerable to red-teaming attacks that can reconstruct the erased concepts using cleverly designed prompts. This poses risks of misuse and necessitates more robust erasure methods.	RACE leverages an adversarial training framework that identifies adversarial text embeddings capable of reconstructing erased concepts. It efficiently uncovers these embeddings within a single timestep of the diffusion process and integrates them into the concept erasure workflow, enhancing the model's resilience against attacks.	RACE significantly reduces the Attack Success Rate (ASR) against both white-box and black-box attacks targeting various concepts, including artistic styles, explicit content, and objects. For instance, RACE achieves over a 30% reduction in ASR for the 'nudity' concept against the leading white-box attack method. RACE exhibits disentanglement capabilities, effectively erasing target concepts while minimizing the impact on the generation of other unrelated concepts.	There's a trade-off observed between enhancing robustness and maintaining image quality, particularly noticeable when erasing concepts beyond artistic styles. The selection of representative keywords for concept erasure significantly influences the effectiveness of the method, as highlighted by the challenges in erasing 'violence' and 'illegal act' content.	text-to-image synthesis, concept erasure, adversarial training, diffusion models, robustness
2405.16287 Report	LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters	Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu	A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and language models (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .	This paper proposes LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts.	A good initialization of deep learning models is essential, but pretraining large models is unaffordable for many researchers. Existing GHNs have limitations in predicting parameters of very wide networks.	The paper introduces LoGAH, a novel low-rank parameter decoder that reduces the number of parameters required for prediction. It also creates new datasets, ViTs-1K and GPTs-1K, containing diverse ViT-style and GPT-2-style computational graphs, respectively.	LoGAH outperforms GHN-3 and random initialization in initializing ViT and GPT-2 models, achieving better performance on CIFAR, ImageNet, and WikiText datasets. Increasing the meta-batch size during training can improve LoGAH performance significantly. LoGAH demonstrates promising transfer learning ability, showing good performance when trained on a smaller dataset and used for initializing larger tasks.	The GPT-2 experiments are limited to the WikiText dataset and smaller LoGAH models due to time and resource constraints. Training on larger datasets and exploring LoGAH's capability on modern LLMs is left for future work.	graph hypernetworks, parameter prediction, model initialization, vision transformers, gpt-2
2405.16260 Report	Enhancing Consistency-Based Image Generation via Adversarialy-Trained Classification and Energy-Based Discrimination	Shelly Golan, Roy Ganz, Michael Elad	The recently introduced Consistency models pose an efficient alternative to diffusion algorithms, enabling rapid and good quality image synthesis. These methods overcome the slowness of diffusion models by directly mapping noise to data, while maintaining a (relatively) simpler training. Consistency models enable a fast one- or few-step generation, but they typically fall somewhat short in sample quality when compared to their diffusion origins. In this work we propose a novel and highly effective technique for post-processing Consistency-based generated images, enhancing their perceptual quality. Our approach utilizes a joint classifier-discriminator model, in which both portions are trained adversarially. While the classifier aims to grade an image based on its assignment to a designated class, the discriminator portion of the very same network leverages the softmax values to assess the proximity of the input image to the targeted data manifold, thereby serving as an Energy-based Model. By employing example-specific projected gradient iterations under the guidance of this joint machine, we refine synthesized images and achieve an improved FID scores on the ImageNet 64x64 dataset for both Consistency-Training and Consistency-Distillation techniques.	This paper introduces a novel post-processing technique to enhance the perceptual quality of images generated by Consistency models using a joint classifier-discriminator network.	Consistency models offer fast image synthesis but often lack the quality of diffusion models. This method bridges this quality gap without extensive retraining.	A joint classifier-discriminator is adversarially trained on both real and synthetic images. This model then guides the refinement of generated images using projected gradient iterations, aiming to align them with both a target class and the real data manifold.	The method significantly improves FID scores on ImageNet 64x64 for both Consistency-Training (27.48% boost) and Consistency-Distillation (20.96% boost). The joint classifier-discriminator proves more effective than using a robust classifier alone (BIGROC), showing an additional 11.2% FID improvement. Preliminary results suggest the method's generalizability to other generative models beyond Consistency models.	The study is limited by the capabilities of the chosen RN50 architecture for the joint model. Training relies solely on Consistency-generated images, limiting its generalization potential. Future work could explore diverse datasets with different generative models.	image synthesis, consistency models, perceptual quality, adversarial training, energy-based models
2405.16098 Report	Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion	Zizhao Hu, Mohammad Rostami	The Transformer architecture has dominated machine learning in a wide range of tasks. The specific characteristic of this architecture is an expensive scaled dot-product attention mechanism that models the inter-token interactions, which is known to be the reason behind its success. However, such a mechanism does not have a direct parallel to the human brain which brings the question if the scaled-dot product is necessary for intelligence with strong expressive power. Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization MLP (L-MLP). Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is based on a multi-layer perceptron (MLP) that permutes data dimensions, processes each dimension in parallel, merges them, and finally passes through a joint MLP. We discover that this specific design outperforms other MLP variants and performs comparably to a transformer-based architecture in the challenging diffusion task while being highly efficient. We conduct experiments using text-to-image generation tasks to demonstrate the effectiveness and efficiency of L-MLP. Further, we look into the model behavior and discover a connection to the function of the human brain. Our code is publicly available: \url{https://github.com/zizhao-hu/L-MLP}	This paper proposes L-MLP, a novel MLP-based architecture for vision tasks inspired by the functional lateralization of the human brain.	The dominant Transformer architecture, while effective, lacks a direct parallel in the human brain and relies on computationally expensive attention mechanisms. L-MLP offers a simpler, more brain-inspired, and computationally efficient alternative.	L-MLP leverages a two-stage processing approach with dimension permutation, separate normalization and transformations for different dimensions, merging of processed features, and residual connections. The authors demonstrate the architecture's effectiveness on a challenging text-to-image diffusion task.	L-MLP achieves comparable image generation quality to Transformer-based models on the MS-COCO dataset, achieving an FID score of 8.62. The architecture demonstrates superior computational efficiency, with faster training and inference speeds compared to Transformers. Analysis of L-MLP reveals functional lateralization within the network during training, mimicking the behavior of the human brain.	L-MLP still exhibits an expressive gap compared to Transformer-based models, potentially due to the absence of higher-order interactions present in attention mechanisms. The current design's quadratic scaling to sequence length limits its application in natural language processing tasks requiring the handling of long sequences.	mlp, vision transformer, diffusion models, text-to-image generation, brain-inspired ai
2405.16034 Report	DiffuBox: Refining 3D Object Detection with Point Diffusion	Xiangyu Chen, Zhenzhen Liu, Katie Z Luo, Siddhartha Datta, Adhitya Polavaram, Yan Wang, Yurong You, Boyi Li, Marco Pavone, Wei-Lun Chao, Mark Campbell, Bharath Hariharan, Kilian Q. Weinberger	Ensuring robust 3D object detection and localization is crucial for many applications in robotics and autonomous driving. Recent models, however, face difficulties in maintaining high performance when applied to domains with differing sensor setups or geographic locations, often resulting in poor localization accuracy due to domain shift. To overcome this challenge, we introduce a novel diffusion-based box refinement approach. This method employs a domain-agnostic diffusion model, conditioned on the LiDAR points surrounding a coarse bounding box, to simultaneously refine the box's location, size, and orientation. We evaluate this approach under various domain adaptation settings, and our results reveal significant improvements across different datasets, object classes and detectors.	This paper introduces a novel diffusion-based box refinement approach for domain adaptation in 3D object detection, which refines bounding box location, size, and orientation using a domain-agnostic diffusion model conditioned on LiDAR points.	Robust 3D object detection is crucial for robotics and autonomous driving, but existing models struggle with domain shift. This method addresses the challenge of maintaining high performance across domains with different sensor setups or geographic locations.	The method leverages a point cloud diffusion model trained on a normalized box view (NBV) to learn the scale-invariant distribution of points relative to object bounding boxes. This allows for refining noisy bounding box proposals from object detectors without retraining.	The method significantly improves mAP performance (up to 24 mAP) across different datasets, object classes, and detectors. It shows particularly strong improvements in near-range box refinement where point density is higher. The approach complements existing domain adaptation methods and further improves their performance.	The method currently doesn't address false negatives in object detection. Future work could explore incorporating exploration strategies or distilling detectors to handle false negatives.	3d object detection, domain adaptation, diffusion models, lidar point clouds, autonomous driving
2405.16009 Report	Streaming Long Video Understanding with Large Language Models	Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang	This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extracted from long videos. Previous works rely on sparse sampling or frame compression to reduce tokens. However, such approaches either disregard temporal information in a long time span or sacrifice spatial details, resulting in flawed compression. To address these limitations, our VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. The Memory-Propagated Streaming Encoding architecture segments long videos into short clips and sequentially encodes each clip with a propagated memory. In each iteration, we utilize the encoded results of the preceding clip as historical memory, which is integrated with the current clip to distill a condensed representation that encapsulates the video content up to the current timestamp. After the encoding process, the Adaptive Memory Selection strategy selects a constant number of question-related memories from all the historical memories and feeds them into the LLM to generate informative responses. The question-related selection reduces redundancy within the memories, enabling efficient and precise video understanding. Meanwhile, the disentangled video extraction and reasoning design allows the LLM to answer different questions about a video by directly selecting corresponding memories, without the need to encode the whole video for each question. Our model achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension for detailed question answering.	This paper proposes VideoStreaming, a novel Vision-Language Large Model (VLLM) that understands arbitrarily long videos efficiently using a fixed number of video tokens by streamingly encoding and adaptively selecting memories.	Understanding long videos is a challenge for VLLMs due to the high computational cost and potential information loss from large token sequences extracted from videos. Existing methods based on sparse sampling or frame compression fail to fully capture temporal dynamics or require recomputation for different queries.	VideoStreaming introduces two core designs: (1) Memory-Propagated Streaming Encoding divides a long video into short clips and encodes each clip sequentially using a small language model, incorporating historical information from the previous clip. (2) Adaptive Memory Selection uses a question-related indicator to select a fixed number of most relevant memories from all encoded memories, reducing redundancy and enabling precise video understanding.	VideoStreaming outperforms existing methods on various long video benchmarks, including VideoChatGPT, EgoSchema, Next-QA, Next-GQA, MovieChat-1K, and MovieNet-QA. The model exhibits superior temporal understanding, evidenced by its high performance on tasks requiring precise temporal grounding. VideoStreaming achieves high inference efficiency by significantly reducing the number of tokens fed into the LLM compared to existing methods.	The current uniform frame sampling strategy could be improved by considering the information density of different video segments. Exploration of adaptive segmentation techniques that dynamically adjust clip lengths based on video content complexity is a promising direction for future work.	vision-language model, long video understanding, video question answering, temporal grounding, memory-propagated streaming encoding, adaptive memory selection
2405.16005 Report	PTQ4DiT: Post-training Quantization for Diffusion Transformers	Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan	The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.	This paper proposes PTQ4DiT, a novel post-training quantization method specifically designed for Diffusion Transformers (DiTs) that effectively reduces their computational complexity while maintaining high-quality image generation.	Diffusion Transformers (DiTs) have shown exceptional image generation capabilities but their high computational cost at inference hinders their deployment in real-time applications. PTQ4DiT addresses this by enabling efficient inference through quantization without the need for costly retraining.	PTQ4DiT tackles the challenges of salient channels and temporal variation in DiTs by introducing: (1) Channel-wise Salience Balancing (CSB) to redistribute extreme magnitudes in activation and weight channels, and (2) Spearman's rho-guided Salience Calibration (SSC) to dynamically adjust salience evaluations across different timesteps. Additionally, a re-parameterization scheme ensures efficient inference by pre-integrating balancing matrices.	PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving generation quality comparable to the full-precision models. The method enables effective quantization to 4-bit weight precision (W4A8) for the first time, achieving significantly better performance than existing PTQ methods. Ablation studies validate the effectiveness of the proposed CSB and SSC components in improving the quantization performance.	The research currently focuses on visual generation. The ethical considerations of potential misuse of generative models are acknowledged but not fully addressed in the scope of this work.	diffusion models, transformers, model quantization, image generation, real-time applications
2405.15914 Report	ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching	Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan	Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to construct Interval Score Matching (ISM). However, ISM inevitably inherits inconsistencies from DDIM, causing reconstruction errors during the DDIM inversion process. This results in poor performance in the detailed generation of 3D objects and loss of content. To alleviate these problems, we propose a novel method named Exact Score Matching (ESM). Specifically, ESM leverages auxiliary variables to mathematically guarantee exact recovery in the DDIM reverse process. Furthermore, to effectively capture the dynamic changes of the original and auxiliary variables, the LoRA of a pre-trained diffusion model implements these exact paths. Extensive experiments demonstrate the effectiveness of ESM in text-to-3D generation, particularly highlighting its superiority in detailed generation.	This paper proposes Exact Score Matching (ESM), a novel text-to-3D generation method that improves consistency and detail by addressing limitations of the DDIM inversion process in Interval Score Matching (ISM).	Over-smoothing in existing text-to-3D methods hinders the generation of detailed, high-fidelity 3D models. This paper addresses this by mitigating inconsistencies in the DDIM inversion process used in ISM.	ESM introduces auxiliary noise variables to construct an exact recovery path during DDIM inversion. It leverages LoRA to adapt a pre-trained 2D diffusion model, effectively capturing the dynamic changes of original and auxiliary noise variables.	ESM generates high-fidelity 3D models consistent with given text prompts. Qualitative comparisons show ESM surpasses existing methods in detail generation, particularly in complex geometries and textures. Experiments demonstrate the impact of hyperparameters like mixture ratio and step sizes on generation quality.	The method can exhibit unstable generation in some cases. Generation quality is sensitive to hyperparameter tuning.	text-to-3d generation, diffusion models, score distillation sampling, denoising diffusion implicit models, exact score matching
2405.15891 Report	Score Distillation via Reparametrized DDIM	Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini, Timur Bagautdinov, Vincent Sitzmann, Justin Solomon	While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.	This paper proposes Score Distillation via Inversion (SDI), a method for 3D shape generation that improves upon Score Distillation Sampling (SDS) by addressing the discrepancy in quality between 2D and 3D generation with diffusion models.	While 2D diffusion models excel at generating realistic images, 3D shape generation methods like SDS often produce over-smoothed and less detailed results. This paper aims to bridge this quality gap.	The paper analyzes the SDS algorithm and reveals its connection to DDIM. It then proposes replacing the random noise sampling in SDS with DDIM inversion to improve noise estimation and generation quality.	SDI generates 3D objects with significantly higher fidelity and detail compared to SDS, closing the quality gap to 2D diffusion models. The paper provides theoretical insights into the relationship between SDS and DDIM, showing that SDS can be interpreted as a high-variance version of DDIM. Through experiments and ablations, the authors demonstrate the effectiveness of DDIM inversion for noise estimation in SDS and show SDI achieves comparable or better results to state-of-the-art methods without additional training or complex pipelines.	The paper identifies limitations related to 3D consistency and content drift between views, suggesting future work on incorporating depth or normal estimation and stronger view conditioning. Another limitation stems from the algorithm inheriting biases and limitations present in the underlying 2D diffusion model, such as generating unrealistic features or skewed distributions.	3d shape generation, diffusion models, score distillation, ddim inversion, nerf
2405.15885 Report	Diffusion Bridge Implicit Models	Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu	Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we present diffusion bridge implicit models (DBIMs) for accelerated sampling of diffusion bridges without extra training. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same training objective as DDBMs. These generalized diffusion bridges give rise to generative processes ranging from stochastic to deterministic (i.e., an implicit probabilistic model) while being up to 25$\times$ faster than the vanilla sampler of DDBMs. Moreover, the deterministic sampling procedure yielded by DBIMs enables faithful encoding and reconstruction by a booting noise used in the initial sampling step, and allows us to perform semantically meaningful interpolation in image translation tasks by regarding the booting noise as the latent variable.	This paper proposes Diffusion Bridge Implicit Models (DBIMs) for accelerated sampling of Denoising Diffusion Bridge Models (DDBMs).	DDBMs are powerful for interpolating paired distributions but suffer from slow sampling, DBIMs aim to address this limitation.	The paper generalizes DDBMs to non-Markovian diffusion bridges on discretized timesteps, enabling deterministic sampling akin to implicit probabilistic models.	DBIMs achieve up to 25x faster sampling than DDBMs without extra training. DBIMs achieve state-of-the-art FID scores in image translation and restoration tasks with 100 sampling steps. Deterministic DBIMs enable faithful encoding and semantically meaningful interpolation.	DBIMs, while faster than DDBMs, are still slower than GAN-based methods for one-step generation. Future work includes developing dedicated ODE solvers for DDBMs and exploring bridge distillation methods.	diffusion bridge models, implicit models, accelerated sampling, image translation, image restoration
2405.15881 Report	Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation	Shentong Mo, Yapeng Tian	In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures.	This paper proposes Diffusion Mamba (DiM), a novel diffusion model architecture for image and video generation that leverages the efficiency of the Mamba architecture, replacing traditional attention mechanisms with state space models to reduce computational complexity.	Existing diffusion models, particularly diffusion transformers (DiT), face scalability limitations due to the quadratic complexity of attention mechanisms, hindering their application to high-resolution image and video generation tasks. DiM addresses this challenge by integrating the Mamba architecture's linear complexity for efficient sequence processing.	DiM adapts the Mamba architecture to handle 2D image data by transforming latent image representations into sequences of patches processed by DiM blocks. Each DiM block employs bidirectional state space models to capture spatial dependencies within and across frames, ensuring temporal coherence in video generation.	DiM consistently outperforms DiT across various model sizes and training steps on image generation benchmarks like ImageNet, demonstrating faster convergence and better FID-50K scores. The architecture exhibits significant computational efficiency, achieving comparable or better image generation quality with notably lower Gflops compared to DiT, particularly at higher resolutions. DiM effectively extends to video generation, achieving competitive Frechet Video Distance (FVD) scores on the UCF-101 dataset, demonstrating its ability to generate high-fidelity and temporally coherent video clips.	The performance of DiM in video generation, specifically for scenarios involving highly dynamic content or low visibility, requires further investigation. The current DiM implementation's capacity to capture long-term dependencies in extended video sequences, essential for long-form video generation, needs further exploration and potential enhancements.	diffusion models, image generation, video generation, state space models, mamba architecture
2405.15769 Report	FastDrag: Manipulate Anything in One Step	Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, Pengming Feng	Drag-based image editing using generative models provides precise control over image contents, enabling users to manipulate anything in an image with a few clicks. However, prevailing methods typically adopt $n$-step iterations for latent semantic optimization to achieve drag-based image editing, which is time-consuming and limits practical applications. In this paper, we introduce a novel one-step drag-based image editing method, i.e., FastDrag, to accelerate the editing process. Central to our approach is a latent warpage function (LWF), which simulates the behavior of a stretched material to adjust the location of individual pixels within the latent space. This innovation achieves one-step latent semantic optimization and hence significantly promotes editing speeds. Meanwhile, null regions emerging after applying LWF are addressed by our proposed bilateral nearest neighbor interpolation (BNNI) strategy. This strategy interpolates these regions using similar features from neighboring areas, thus enhancing semantic integrity. Additionally, a consistency-preserving strategy is introduced to maintain the consistency between the edited and original images by adopting semantic information from the original image, saved as key and value pairs in self-attention module during diffusion inversion, to guide the diffusion sampling. Our FastDrag is validated on the DragBench dataset, demonstrating substantial improvements in processing time over existing methods, while achieving enhanced editing performance.	This paper introduces FastDrag, a novel one-step drag-based image editing method that significantly accelerates the editing process while maintaining high quality.	Existing drag-based image editing methods rely on time-consuming n-step iterative optimizations, limiting their practicality. FastDrag addresses this limitation by enabling one-step optimization.	FastDrag employs a latent warpage function (LWF) to simulate the behavior of stretched materials, enabling one-step adjustment of pixel locations in the latent space. It also utilizes bilateral nearest neighbor interpolation (BNNI) to fill null regions and a consistency-preserving strategy to maintain image coherence.	FastDrag is significantly faster than state-of-the-art methods, achieving up to 700% speed improvement. It achieves comparable, if not better, editing performance compared to existing techniques. FastDrag maintains high image quality even in complex textures and multi-point dragging scenarios.	The paper focuses on drag-based editing and may not generalize to other editing paradigms. Future work could explore extending FastDrag to handle more complex editing tasks.	image editing, drag-based editing, diffusion models, latent space manipulation, one-step optimization
2405.15758 Report	InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation	Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian	Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.	Introduces InstructAvatar, a text-guided diffusion-based model for generating expressive 2D talking avatars with fine-grained control over emotions and facial motions.	Existing talking avatar generation models struggle with conveying and controlling detailed expressions and motions, resulting in less vivid and controllable videos.	Leverages a natural language interface to control avatar expressions and motions. Employs an automatic annotation pipeline with GPT-4V to generate fine-grained text instructions from videos. Uses a two-branch diffusion model with cross-attention to incorporate emotion and motion instructions during video generation.	InstructAvatar significantly improves emotion control, lip-sync quality, and naturalness compared to existing methods. Enables control over a wider range of instructions due to its natural language interface. Successfully animates avatars directly from text instructions without relying on audio cues.	Limited ability to control disentangled single action units due to training on combined action units. Modest training dataset size may hinder robustness in handling out-of-domain instructions or appearances. Inability to simultaneously control emotion and motion due to training data limitations.	talking avatar generation, emotion control, facial motion control, text-guided generation, diffusion models
2405.15757 Report	Looking Backward: Streaming Video-to-Video Translation with Feature Banks	Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu	This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.	StreamV2V is a novel diffusion model that performs real-time video-to-video translation on streaming video inputs, unlike previous batch-based methods limited to short clips.	Existing video-to-video translation methods are constrained by batch processing, limiting their ability to handle long or streaming videos in real-time applications.	StreamV2V processes frames sequentially, maintaining a compact feature bank of past frames. It leverages extended self-attention and direct feature fusion to ensure temporal consistency during generation, building upon the StreamDiffusion framework.	StreamV2V achieves real-time performance (20 FPS on a single A100 GPU), significantly outpacing prior V2V methods. A dynamic merging strategy for the feature bank balances compactness with informativeness, enabling efficient processing without sacrificing consistency. Quantitative metrics and user studies demonstrate StreamV2V's effectiveness in maintaining temporal consistency while enabling real-time video editing.	StreamV2V's editing capability is currently limited by the underlying image editing method (SDEdit). While generally consistent, the model can produce artifacts, especially in videos with rapid camera or object movement, leaving room for further improvement.	video-to-video translation, diffusion models, real-time processing, streaming video, feature banks
2405.15738 Report	ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng	High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.	This paper introduces ConvLLaVA, a large multimodal model that utilizes a five-stage ConvNeXt as its visual encoder to address the challenges of excessive visual tokens and quadratic visual complexity in high-resolution images.	Existing high-resolution large multimodal models (LMMs) often generate excessive visual tokens, leading to high computational costs and hindering efficient visual information extraction. ConvLLaVA tackles this issue by effectively compressing high-resolution images into information-rich visual features.	The authors propose ConvLLaVA, which replaces the traditional Vision Transformer (ViT) with a hierarchical ConvNeXt backbone as the visual encoder. They introduce two key optimizations: (1) updating the pretrained ConvNeXt for better performance on high-resolution images and (2) adding a fifth stage to ConvNeXt to further compress visual tokens, reducing redundancy.	ConvLLaVA with a five-stage ConvNeXt successfully compresses visual information, generating fewer visual tokens than ViT-based models at the same resolution. Updating the pretrained ConvNeXt for high-resolution inputs is crucial for achieving competitive performance on general capability benchmarks. Higher-resolution ConvLLaVA models consistently outperform lower-resolution counterparts on fine-grained tasks, indicating the effectiveness of compressing high-resolution images into information-rich visual tokens.	The relatively small kernel size of the current ConvNeXt architecture, optimized for low-resolution images, may limit capacity at extremely high resolutions. The optimal balance between visual information compression and retrieval capabilities for high-resolution understanding requires further investigation.	large multimodal models, convnext, visual token compression, high-resolution image understanding, vision-language models
2405.15734 Report	LM4LV: A Frozen Large Language Model for Low-level Vision Tasks	Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong	The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms.	This paper investigates the capability of a frozen Large Language Model (LLM) to process and generate low-level visual features, demonstrating its potential in solving low-level vision tasks like image denoising and deraining without multi-modal data or prior.	Bridging the gap between MLLMs, which excel in high-level vision tasks, and low-level vision tasks is crucial for leveraging LLMs' reasoning and text generation abilities for better user interaction and interpretability in low-level vision.	The paper proposes LM4LV, a framework that integrates a fine-tuned Masked Autoencoder (MAE) with a frozen LLM. This framework uses linear layers to adapt between visual and text features. The LLM is trained to autoregressively generate visual features conditioned on degraded images and task tokens.	LM4LV successfully performs various low-level vision tasks like denoising, deblurring, and deraining, showcasing the LLM's ability to process low-level features. The choice of the vision module is crucial, with MAE outperforming VQGAN and BEiT due to its superior image reconstruction ability and potential alignment with the LLM's representation space. Auto-regressive generation is essential for LM4LV's success, as a more straightforward ViT-LLM generation scheme fails to produce high-quality results.	LM4LV struggles to restore high-frequency details due to the lack of image prior. There is a performance gap between LM4LV and a one-layer Transformer in denoising, indicating room for improvement.	large language models, low-level vision, multi-modality, image restoration, auto-regressive generation
2405.15688 Report	UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes	Ted Lentsch, Holger Caesar, Dariu M. Gavrila	Unsupervised 3D object detection methods have emerged to leverage vast amounts of data efficiently without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect objects but penalize the detections of static instances during training. Multiple rounds of (self) training are used in which detected static instances are added to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR. Subsequently, object proposals' visual appearances are encoded to distinguish static objects in the foreground and background by selecting static instances that are visually similar to dynamic objects. As a result, static and dynamic foreground objects are obtained together, and existing detectors can be trained with a single training. In addition, we extend 3D object discovery to detection by using object appearance-based cluster labels as pseudo-class labels for training object classification. We conduct extensive experiments on the nuScenes dataset and increase the state-of-the-art performance for unsupervised object discovery, i.e. UNION more than doubles the average precision to 33.9. The code will be made publicly available.	UNION, a novel framework for unsupervised 3D object detection that leverages LiDAR, camera, and temporal information jointly to generate pseudo-labels for training existing object detectors without manual annotations.	Unsupervised 3D object detection reduces the dependency on expensive manual labeling, making it important for leveraging large-scale datasets. Existing methods suffer from limitations like iterative self-training and difficulty in detecting static foreground objects.	UNION first generates object proposals by clustering LiDAR points and estimating their motion using scene flow. Then, it encodes the visual appearance of these proposals using a pre-trained vision foundation model and clusters them based on appearance similarity. Finally, it identifies and leverages the clusters containing dynamic objects to discover both static and dynamic mobile objects, generating pseudo-labels for training.	UNION significantly outperforms existing unsupervised 3D object discovery methods on the nuScenes dataset, achieving more than double the average precision of the best baseline. Appearance-based clustering is identified as the key component driving UNION's performance improvement. UNION demonstrates the feasibility of unsupervised multi-class 3D object detection by generating pseudo-class labels from appearance clusters.	The performance of UNION on rare object classes may be limited due to assumptions made about object frequency during appearance clustering. Future work includes extending UNION to handle rare classes more effectively and incorporating radar data for improved motion estimation.	unsupervised learning, 3d object detection, lidar, camera, scene flow
2405.15622 Report	LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction from Single Image	Ruikai Cui, Xibin Song, Weixuan Sun, Senbo Wang, Weizhe Liu, Shenzhou Chen, Taizhang Shang, Yang Li, Nick Barnes, Hongdong Li, Pan Ji	Large Reconstruction Models have made significant strides in the realm of automated 3D content generation from single or multiple input images. Despite their success, these models often produce 3D meshes with geometric inaccuracies, stemming from the inherent challenges of deducing 3D shapes solely from image data. In this work, we introduce a novel framework, the Large Image and Point Cloud Alignment Model (LAM3D), which utilizes 3D point cloud data to enhance the fidelity of generated 3D meshes. Our methodology begins with the development of a point-cloud-based network that effectively generates precise and meaningful latent tri-planes, laying the groundwork for accurate 3D mesh reconstruction. Building upon this, our Image-Point-Cloud Feature Alignment technique processes a single input image, aligning to the latent tri-planes to imbue image features with robust 3D information. This process not only enriches the image features but also facilitates the production of high-fidelity 3D meshes without the need for multi-view input, significantly reducing geometric distortions. Our approach achieves state-of-the-art high-fidelity 3D mesh reconstruction from a single image in just 6 seconds, and experiments on various datasets demonstrate its effectiveness.	The paper introduces LAM3D, a Large Image and Point Cloud Alignment Model for enhancing the fidelity of 3D meshes generated from single images by utilizing 3D point cloud data as priors.	Existing large reconstruction models often generate inaccurate 3D meshes from single or few-shot images due to the difficulty of deducing 3D shapes solely from 2D data.	The method involves two stages: 1) compressing point clouds into latent tri-plane representations, and 2) aligning single image features to these tri-planes using a diffusion-based approach.	LAM3D achieves state-of-the-art high-fidelity 3D mesh reconstruction from single images in 6 seconds. The use of point cloud priors significantly reduces geometric distortions compared to models relying solely on images. Independent diffusion processes for each tri-plane (XY, XZ, YZ) improve preservation of 3D structural information.	The current model focuses on geometry reconstruction and lacks texture reconstruction capabilities. Future work will explore extending LAM3D for geometric and texture reconstruction.	3d reconstruction, point cloud, diffusion models, feature alignment, single image
2405.15619 Report	DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation	Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, Dongyan Guo	Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.	Presents DiffCalib, a novel diffusion-based approach for robust and accurate monocular camera calibration from single in-the-wild images by reformulating intrinsic parameter estimation as a dense incident map generation task.	Existing methods struggle to generalize across diverse real-world scenarios due to reliance on specific geometric assumptions or objects. DiffCalib leverages the rich visual knowledge of pre-trained diffusion models to overcome this limitation, improving calibration accuracy and robustness.	Utilizes a pre-trained Stable Diffusion model with a frozen VAE encoder/decoder. Fine-tunes the U-Net to generate incident maps, representing the angle of incidence for each pixel. Optionally jointly estimates depth maps to enhance performance. Employs a non-learning RANSAC algorithm to derive intrinsic parameters from the generated incident map.	Achieves state-of-the-art performance on both seen and unseen datasets, demonstrating superior generalization ability. Jointly estimating incident and depth maps mutually benefits both tasks, resulting in more accurate predictions. Enables high-quality 3D reconstruction from single in-the-wild images, surpassing existing methods in detail preservation and geometric accuracy.	Limited exploration of the impact of different diffusion model architectures and training strategies on calibration performance. Reliance on a pre-trained Stable Diffusion model necessitates significant computational resources for training and inference.	monocular camera calibration, diffusion models, incident map generation, depth estimation, 3d reconstruction
2405.15580 Report	Open-Vocabulary SAM3D: Understand Any 3D Scene	Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Yabiao Wang, Yong Liu	Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without the need for training. In this paper, we introduce OV-SAM3D, a universal framework for open-vocabulary 3D scene understanding. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world label. Empirical evaluations conducted on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.	Presents OV-SAM3D, a universal open-vocabulary 3D scene understanding framework capable of interpreting any 3D scene without prior knowledge.	Addresses the challenge of open-vocabulary 3D scene understanding where models must locate and recognize objects in 3D scenes from text guidance, even for unseen objects, without relying on specific 3D scene dataset knowledge.	Leverages superpoints as initial 3D prompts, refines them using SAM-derived segmentation masks, and employs an overlapping score table with RAM-recognized open tags to produce final 3D instances with open-world labels.	Surpasses existing open-vocabulary methods in unknown open-world environments on ScanNet200 and nuScenes datasets. Demonstrates the effectiveness of combining multiple foundation models (SAM, RAM, CLIP) for open-vocabulary 3D scene understanding. Highlights the potential of transferring knowledge from 2D foundation models to the 3D domain for zero-shot learning.	Current limitations in vision foundation models' ability to handle complex scenes with zero-shot learning. Dependence on the performance of underlying foundation models (SAM, RAM, CLIP).	open-vocabulary learning, 3d scene understanding, zero-shot learning, foundation models, segment anything model (sam)
2405.15574 Report	Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro	The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.	Introduces Meteor, an efficient large language and vision model (LLVM) that leverages the Mamba architecture and a novel "traversal of rationale" concept to embed and utilize multifaceted rationales for enhanced understanding and answering capabilities in vision-language tasks.	Addresses the need for efficient LLVMs that can implicitly embed multifaceted information (image understanding, common-sense knowledge, non-object concept comprehension, etc.) without relying on model scaling or additional vision encoders/models during inference.	Combines a Mamba architecture for embedding lengthy rationales with a pretrained multimodal language model (MLM) trained on a curated dataset of 1.1M question-rationale-answer triples. Introduces "traversal of rationale" using special tokens to effectively convey rationale information to the MLM without explicit rationale access during inference.	Meteor significantly outperforms existing open- and closed-source LLVMs on various benchmarks requiring diverse capabilities, including MME, MMB, and MM-Vet. Ablation studies confirm the effectiveness of the Mamba architecture, rationale embedding, traversal of rationale, and the curated dataset in achieving superior performance. Analysis of Meteor-Mamba reveals its ability to effectively embed rationales, enabling the model to leverage multifaceted information even without explicit rationale access during inference.	Model size, while smaller than many large LLVMs, could still be prohibitive for users without high-end GPU resources. Future work includes exploring layer-analyzing approaches like mixture of depths to further reduce model size while maintaining performance.	large language and vision models, rationale-guided prediction, multifaceted rationale, mamba architecture, traversal of rationale
2405.15491 Report	GSDeformer: Direct Cage-based Deformation for 3D Gaussian Splatting	Jiajun Huang, Hongchuan Yu	We present GSDeformer, a method that achieves free-form deformation on 3D Gaussian Splatting(3DGS) without requiring any architectural changes. Our method extends cage-based deformation, a traditional mesh deformation method, to 3DGS. This is done by converting 3DGS into a novel proxy point cloud representation, where its deformation can be used to infer the transformations to apply on the 3D gaussians making up 3DGS. We also propose an automatic cage construction algorithm for 3DGS to minimize manual work. Our method does not modify the underlying architecture of 3DGS. Therefore, any existing trained vanilla 3DGS can be easily edited by our method. We compare the deformation capability of our method against other existing methods, demonstrating the ease of use and comparable quality of our method, despite being more direct and thus easier to integrate with other concurrent developments on 3DGS.	GSDeformer: the first method for free-form deformation of 3D Gaussian Splatting scenes without modifying the underlying architecture.	Existing 3DGS deformation methods require architecture changes, limiting their use for editing pre-trained scenes or integration with other 3DGS techniques.	1. Convert 3DGS to a proxy point cloud representation. 2. Deform the point cloud using cage-based deformation with user-defined cages. 3. Infer transformations from deformed points and apply them to original 3D Gaussians.	Achieves high-quality deformation on synthetic and real-world 3DGS captures. Produces comparable deformation quality to state-of-the-art methods like DeformingNeRF, SuGaR, and GaMeS. Offers advantages by directly editing 3DGS without architecture modification, enabling easier integration and application to pre-trained models.	Current implementation lacks real-time performance. Future work includes exploring faster deformation schemes and incorporating color parameter transformations.	3d gaussian splatting, deformation, cage-based deformation, scene manipulation, 3d scene editing
2405.15475 Report	Efficient Degradation-aware Any Image Restoration	Eduard Zamfir, Zongwei Wu, Nancy Mehta, Danda Dani Paudel, Yulun Zhang, Radu Timofte	Reconstructing missing details from degraded low-quality inputs poses a significant challenge. Recent progress in image restoration has demonstrated the efficacy of learning large models capable of addressing various degradations simultaneously. Nonetheless, these approaches introduce considerable computational overhead and complex learning paradigms, limiting their practical utility. In response, we propose \textit{DaAIR}, an efficient All-in-One image restorer employing a Degradation-aware Learner (DaLe) in the low-rank regime to collaboratively mine shared aspects and subtle nuances across diverse degradations, generating a degradation-aware embedding. By dynamically allocating model capacity to input degradations, we realize an efficient restorer integrating holistic and specific learning within a unified model. Furthermore, DaAIR introduces a cost-efficient parameter update mechanism that enhances degradation awareness while maintaining computational efficiency. Extensive comparisons across five image degradations demonstrate that our DaAIR outperforms both state-of-the-art All-in-One models and degradation-specific counterparts, affirming our efficacy and practicality. The source will be publicly made available at \url{https://eduardzamfir.github.io/daair/}	This paper proposes DaAIR, an efficient and accurate All-in-One image restoration model leveraging a novel Degradation-aware Learner (DaLe) to dynamically route model capacity to specific degradation experts while concurrently modeling shared information across degradation types within a low-rank framework.	Existing image restoration methods often lack practicality due to their specialization in addressing a single degradation at a time, while recent All-in-One models suffer from high computational costs and complex learning paradigms.	DaAIR utilizes a U-shaped architecture with DaLe integrated into each encoder block. DaLe comprises degradation-specific and agnostic experts, employing a routing mechanism to associate input features with their corresponding degradation experts. A self-learnable control mechanism, leveraging encoder knowledge, guides parameter updates in the decoder, enhancing restoration quality.	DaAIR outperforms state-of-the-art All-in-One models on three degradation types (dehazing, deraining, and denoising), achieving an average improvement of 0.45 dB PSNR while being significantly more efficient. The method also excels in a five degradation setting, including deblurring and low-light enhancement, surpassing previous approaches in both performance and efficiency. Ablation studies confirm the efficacy of individual components, highlighting the importance of expert specialization, routing strategy, and self-learnable control for achieving superior restoration quality.	The model currently relies on synthetically degraded images, which might limit its performance on realistic degradation scenarios. Incorporating external inductive biases, like edge information or frequency constraints, could further enhance the model's ability to handle multiple degradation types simultaneously.	image restoration, all-in-one restoration, degradation-aware learning, low-rank representation, self-learnable control
2405.15463 Report	PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis	Zicheng Wang, Zhenghao Chen, Yiming Wu, Zhen Zhao, Luping Zhou, Dong Xu	Point cloud analysis has seen substantial advancements due to deep learning, although previous Transformer-based methods excel at modeling long-range dependencies on this task, their computational demands are substantial. Conversely, the Mamba offers greater efficiency but shows limited potential compared with Transformer-based methods. In this study, we introduce PoinTramba, a pioneering hybrid framework that synergies the analytical power of Transformer with the remarkable computational efficiency of Mamba for enhanced point cloud analysis. Specifically, our approach first segments point clouds into groups, where the Transformer meticulously captures intricate intra-group dependencies and produces group embeddings, whose inter-group relationships will be simultaneously and adeptly captured by efficient Mamba architecture, ensuring comprehensive analysis. Unlike previous Mamba approaches, we introduce a bi-directional importance-aware ordering (BIO) strategy to tackle the challenges of random ordering effects. This innovative strategy intelligently reorders group embeddings based on their calculated importance scores, significantly enhancing Mamba's performance and optimizing the overall analytical process. Our framework achieves a superior balance between computational efficiency and analytical performance by seamlessly integrating these advanced techniques, marking a substantial leap forward in point cloud analysis. Extensive experiments on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our approach, establishing a new state-of-the-art analysis benchmark on point cloud recognition. For the first time, this paradigm leverages the combined strengths of both Transformer and Mamba architectures, facilitating a new standard in the field. The code is available at https://github.com/xiaoyao3302/PoinTramba.	This paper presents PoinTramba, a novel hybrid framework for point cloud analysis that combines the strengths of Transformer and Mamba architectures.	Existing Transformer-based methods, while effective for point cloud analysis, are computationally demanding. Conversely, Mamba offers efficiency but lags in performance. PoinTramba addresses these limitations by leveraging the strengths of both architectures.	PoinTramba segments point clouds into groups and uses Transformer to capture intra-group dependencies, generating group embeddings. Then, a bi-directional importance-aware ordering (BIO) strategy is introduced to reorder group embeddings, followed by a Mamba encoder to capture inter-group relationships efficiently. Finally, importance-aware pooling extracts global features for analysis.	PoinTramba achieves state-of-the-art performance on real-world object classification (ScanObjectNN) and synthetic object classification (ModelNet40). The BIO strategy proves crucial for improving Mamba's performance on unordered point cloud data. Ablation studies validate the effectiveness of each component in PoinTramba, including the hybrid architecture, BIO, and importance-aware pooling.	The study primarily focuses on importance-aware ordering, leaving room to explore alternative sorting algorithms to further optimize Mamba's potential. Further evaluation on a wider range of point cloud tasks is needed to comprehensively assess PoinTramba's capabilities.	point cloud analysis, transformer, mamba, hybrid architecture, importance-aware ordering
2405.15425 Report	Volumetric Primitives for Modeling and Rendering Scattering and Emissive Media	Jorge Condor, Sebastien Speierer, Lukas Bode, Aljaz Bozic, Simon Green, Piotr Didyk, Adrian Jarabo	We propose a volumetric representation based on primitives to model scattering and emissive media. Accurate scene representations enabling efficient rendering are essential for many computer graphics applications. General and unified representations that can handle surface and volume-based representations simultaneously, allowing for physically accurate modeling, remain a research challenge. Inspired by recent methods for scene reconstruction that leverage mixtures of 3D Gaussians to model radiance fields, we formalize and generalize the modeling of scattering and emissive media using mixtures of simple kernel-based volumetric primitives. We introduce closed-form solutions for transmittance and free-flight distance sampling for 3D Gaussian kernels, and propose several optimizations to use our method efficiently within any off-the-shelf volumetric path tracer by leveraging ray tracing for efficiently querying the medium. We demonstrate our method as an alternative to other forms of volume modeling (e.g. voxel grid-based representations) for forward and inverse rendering of scattering media. Furthermore, we adapt our method to the problem of radiance field optimization and rendering, and demonstrate comparable performance to the state of the art, while providing additional flexibility in terms of performance and usability.	This paper introduces a novel volumetric representation for scattering and emissive media based on mixtures of kernel-based volumetric primitives, enabling efficient rendering and optimization within the radiative transfer framework.	Current methods for representing volumetric media, like voxel grids, struggle with memory scalability and efficient light transport calculations. This new approach offers a compact representation and enables closed-form solutions for transmittance and emission, leading to faster rendering and easier integration into physics-based renderers.	The authors leverage Gaussian kernels as their primitives, deriving closed-form expressions for transmittance, emission, and distance sampling. They utilize ray tracing to efficiently query these primitives and integrate their contributions along a ray, implementing their approach within a physics-based renderer. They also develop the adjoint of their method for inverse rendering applications.	The Gaussian primitive-based representation significantly reduces memory consumption compared to voxel grids, while enabling efficient transmittance computations and achieving comparable rendering quality. The method is applicable to inverse rendering, demonstrated through the reconstruction of a scattering smoke plume with significantly less memory than the reference grid. For radiance field rendering, the approach achieves comparable quality to existing methods while allowing for greater control over rendering speed by leveraging techniques like early ray termination.	The current implementation requires a GPU with hardware-accelerated ray tracing for optimal performance. Future work includes exploring more general media types (e.g., anisotropic media), refining the optimization pipeline for complex scenes, and improving the color model for highly anisotropic kernels.	volumetric rendering, radiance fields, gaussian primitives, inverse rendering, light transport
2405.15364 Report	NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer	Meng You, Zhiyu Zhu, Hui Liu, Junhui Hou	By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our NVS-Solver over state-of-the-art methods both quantitatively and qualitatively. \textit{ Source code in } \href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVS$\_$Solver}.	This paper introduces NVS-Solver, a novel training-free approach for novel view synthesis leveraging pre-trained large video diffusion models.	NVS-Solver addresses limitations of existing methods in handling complex scene dynamics and generalizing to new scenes, while offering high-quality results without the need for extensive training.	NVS-Solver modulates the reverse diffusion sampling process using prior information from given views. This modulation is performed adaptively based on an analysis of diffusion estimation error and intensity truncation error, ensuring accurate and visually pleasing novel view generation.	NVS-Solver outperforms state-of-the-art NVS methods both qualitatively and quantitatively in single-view, multi-view, and dynamic scene scenarios. The adaptive modulation of the score function is crucial for accurate view synthesis, correcting warping errors and non-Lambert reflections effectively. Sufficient diffusion reverse steps are essential for accurate view pose estimation in the synthesized novel views.	The current implementation of NVS-Solver requires longer processing time compared to existing methods. Future work will focus on improving the processing speed and exploring the potential of NVS-Solver for pose controllable video diffusion model distillation.	novel view synthesis, video diffusion models, training-free, adaptive modulation, score-based diffusion
2405.15330 Report	Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model	Mingyang Yi, Aoxue Li, Yi Xin, Zhenguo Li	Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.	This paper investigates the working mechanism of text-to-image diffusion models, particularly focusing on how text prompts influence the image generation process.	Understanding this mechanism can lead to improvements in text-to-image generation techniques, such as accelerating the sampling process without sacrificing image quality.	The authors analyze the image reconstruction process in stable diffusion models, examining the role of frequency signals and the influence of different text prompt components (semantic tokens and the special token [EOS]). They conduct experiments by switching [EOS] tokens between prompts and varying the strength of text guidance during different stages of the denoising process.	The image generation process exhibits a 'first overall shape then details' pattern, where the overall shape is reconstructed in the early stages of denoising and details are filled in later. The special token [EOS] in the text prompt plays a dominant role in determining the overall shape of the generated image, conveying more information than semantic tokens. The information from the text prompt is primarily conveyed in the early shape reconstruction stage of the denoising process. Later stages mainly refine details based on the established shape.	The study primarily focuses on a single text-to-image model, Stable Diffusion, and further investigation is needed to generalize the findings to other diffusion-based models. The impact of varying the number of [EOS] tokens and their positions within the prompt requires further exploration to fully understand their role.	text-to-image generation, diffusion models, stable diffusion, text prompt engineering, frequency analysis
2405.15321 Report	SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance	Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen	Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.	This paper introduces the Scene Graph Adapter (SG-Adapter), which leverages scene graph knowledge to improve the contextual understanding and accuracy of text-to-image generation models, addressing the limitations of sequential text embeddings.	Existing text-to-image generation models often misinterpret relationships between entities due to the sequential nature of text processing. This work aims to enhance the control and accuracy of these models by incorporating structured scene graph information.	The SG-Adapter, designed as a transformer module, refines text embeddings using a novel triplet-token attention mechanism. This allows for precise mapping between textual elements and their visual representations in generated images.	The SG-Adapter outperforms existing text-to-image generation methods in accurately depicting complex relationships between multiple entities, as demonstrated by both qualitative and quantitative evaluations. The research contributes a new dataset, MultiRels, featuring multiple relations and high-quality annotations, crucial for training and evaluating multi-relational learning models. The paper introduces three novel metrics derived from GPT-4V to effectively measure the correspondence between generated images and scene graphs, enabling a more accurate assessment of relation generation.	The anonymization of human faces in the MultiRels dataset, while necessary for privacy, might introduce artifacts that could potentially impact image quality. Future work could focus on exploring more sophisticated anonymization techniques and expanding the MultiRels dataset to encompass a wider range of relations and scenarios.	text-to-image generation, scene graph, attention mechanism, relation correspondence, multi-relational learning
2405.15313 Report	Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion	Aoxue Li, Mingyang Yi, Zhenguo Li	Recently, text-to-image (T2I) editing has been greatly pushed forward by applying diffusion models. Despite the visual promise of the generated images, inconsistencies with the expected textual prompt remain prevalent. This paper aims to systematically improve the text-guided image editing techniques based on diffusion models, by addressing their limitations. Notably, the common idea in diffusion-based editing firstly reconstructs the source image via inversion techniques e.g., DDIM Inversion. Then following a fusion process that carefully integrates the source intermediate (hidden) states (obtained by inversion) with the ones of the target image. Unfortunately, such a standard pipeline fails in many cases due to the interference of texture retention and the new characters creation in some regions. To mitigate this, we incorporate human annotation as an external knowledge to confine editing within a ``Mask-informed'' region. Then we carefully Fuse the edited image with the source image and a constructed intermediate image within the model's Self-Attention module. Extensive empirical results demonstrate the proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques.	This paper proposes MaSaFusion, a novel training-free method for enhancing text-to-image editing using diffusion models by incorporating human annotations to improve feature preservation and new feature generation.	Existing diffusion-based text-to-image editing methods struggle with inconsistencies between the generated image and the expected textual prompt, especially when object shapes vary.	MaSaFusion leverages human annotations (sketch and editing region) to construct an intermediate image with desired shape via T2I Adapter and fuses its self-attention maps with source and target images during the generation process.	MaSaFusion outperforms existing training-free methods on MagicBrush dataset in terms of image-text and image-image alignment. Using external knowledge like sketch maps and editing regions significantly improves editing quality. Direct Inversion, though slightly less accurate, offers a significant speedup over Null-text Inversion for practical applications.	The performance of MaSaFusion depends on the accuracy of human annotations (sketch & editing region). MaSaFusion inherits limitations of Stable Diffusion, such as generating inconsistent facial features.	text-to-image editing, diffusion models, human annotation, self-attention, t2i adapter
2405.15305 Report	Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering	Yibo Zhang, Lihong Wang, Changqing Zou, Tieru Wu, Rui Ma	3D sketches are widely used for visually representing the 3D shape and structure of objects or scenes. However, the creation of 3D sketch often requires users to possess professional artistic skills. Existing research efforts primarily focus on enhancing the ability of interactive sketch generation in 3D virtual systems. In this work, we propose Diff3DS, a novel differentiable rendering framework for generating view-consistent 3D sketch by optimizing 3D parametric curves under various supervisions. Specifically, we perform perspective projection to render the 3D rational B\'ezier curves into 2D curves, which are subsequently converted to a 2D raster image via our customized differentiable rasterizer. Our framework bridges the domains of 3D sketch and raster image, achieving end-toend optimization of 3D sketch through gradients computed in the 2D image domain. Our Diff3DS can enable a series of novel 3D sketch generation tasks, including textto-3D sketch and image-to-3D sketch, supported by the popular distillation-based supervision, such as Score Distillation Sampling (SDS). Extensive experiments have yielded promising results and demonstrated the potential of our framework.	This paper presents Diff3DS, a novel differentiable rendering framework for generating view-consistent 3D sketches from diverse inputs like text or single images.	Existing methods for 3D sketch creation are primarily interactive and require professional skills, limiting their accessibility. This work introduces a user-friendly approach for generating view-consistent 3D sketches from commonly available inputs.	The framework represents 3D sketches as a set of 3D rational Bézier curves. It uses perspective projection to obtain 2D curves, then utilizes a customized differentiable rasterizer to render these curves while preserving depth order. It employs Score Distillation Sampling (SDS) to leverage pre-trained 2D image generation models, enabling text or single image guided 3D sketch generation.	Diff3DS is the first to achieve text-to-3D sketch generation, outperforming existing text-to-3D methods adapted for this task. The framework successfully generates 3D sketches from single images, surpassing the performance of a multiview reconstruction-based baseline. Ablation studies validate the contribution of key components like Time Annealing Schedule and Dynamic Noise Deletion.	The framework inherits the sparse gradient issue from DiffVG, limiting its ability to optimize non-continuous parameters. The current approach doesn't differentiate between view-independent and view-dependent curves, potentially limiting the expressiveness of generated 3D shapes. Future work could incorporate diverse curve representations.	3d sketch generation, differentiable rendering, rational bézier curves, score distillation sampling, text-to-3d, image-to-3d
2405.15304 Report	Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient	Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang	Current text-to-image diffusion models have achieved groundbreaking results in image generation tasks. However, the unavoidable inclusion of sensitive information during pre-training introduces significant risks such as copyright infringement and privacy violations in the generated images. Machine Unlearning (MU) provides a effective way to the sensitive concepts captured by the model, has been shown to be a promising approach to addressing these issues. Nonetheless, existing MU methods for concept erasure encounter two primary bottlenecks: 1) generalization issues, where concept erasure is effective only for the data within the unlearn set, and prompts outside the unlearn set often still result in the generation of sensitive concepts; and 2) utility drop, where erasing target concepts significantly degrades the model's performance. To this end, this paper first proposes a concept domain correction framework for unlearning concepts in diffusion models. By aligning the output domains of sensitive concepts and anchor concepts through adversarial training, we enhance the generalizability of the unlearning results. Secondly, we devise a concept-preserving scheme based on gradient surgery. This approach alleviates the parts of the unlearning gradient that contradict the relearning gradient, ensuring that the process of unlearning minimally disrupts the model's performance. Finally, extensive experiments validate the effectiveness of our model, demonstrating our method's capability to address the challenges of concept unlearning in diffusion models while preserving model utility.	This paper introduces a novel approach for unlearning concepts in text-to-image diffusion models, tackling the limitations of existing methods in terms of generalizability and utility drop.	Unlearning concepts in diffusion models is crucial for addressing copyright infringement, privacy violations, and the generation of inappropriate content, which are significant concerns associated with pre-trained models.	The proposed method utilizes a concept domain correction framework with adversarial training to align the output domains of target and anchor concepts, enhancing generalizability. It also employs a concept-preserving gradient strategy based on gradient surgery to minimize the impact of unlearning on the model's performance on other concepts.	The method effectively unlearns specific instances, styles, and inappropriate content while preserving the integrity of other elements and concepts in the generated images. Quantitative evaluations using CLIP Score, CLIP Accuracy, and FID demonstrate superior performance compared to existing methods, striking a balance between unlearning and retaining non-target concepts. Experiments with multiple instance removal and the I2P benchmark showcase the method's capability to handle complex unlearning scenarios and effectively reduce the generation of inappropriate content.	The method still relies on an anchor-based approach, demanding considerable computational overhead for data preparation. Future work could explore the integration of the Latent Anchor method to optimize or bypass the data preparation process.	machine unlearning, diffusion models, text-to-image synthesis, concept erasure, adversarial training, gradient surgery
2405.15287 Report	StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models	Chengming Xu, Kai Hu, Donghao Luo, Jiangning Zhang, Wei Li, Yanhao Ge, Chengjie Wang	Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed as StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which tries to solve the previous problems such as insufficient style and inconsistent semantics. The enhancement lies in two novel module, namely multi-source style embedder and dynamic attention adapter. In order to provide SD with better style embeddings, we propose the multi-source style embedder considers both global and local level visual information along with textual one, which provide both complementary style-related and semantic-related knowledge. Additionally, aiming for better balance between the adaptor capacity and semantic control, the proposed dynamic attention adapter is applied to the diffusion UNet in which adaptation weights are dynamically calculated based on the style embeddings. Two objective functions are introduced to optimize the model together with denoising loss, which can further enhance semantic and style consistency. Extensive experiments demonstrate the superiority of StyleMaster over existing methods, rendering images with variable target styles while successfully maintaining the semantic information from the text prompts.	This paper proposes AnyArt, a novel framework for Stylized Text-to-Image Generation (STIG) that addresses limitations of existing methods in achieving sufficient style and semantic consistency.	STIG is crucial for applications like art creation and movie editing, offering greater flexibility and applicability than traditional style transfer methods.	AnyArt leverages a multi-source style embedder to capture comprehensive style information from reference images while mitigating semantic leakage. It also employs a dynamic attention adapter to effectively integrate style embeddings into the diffusion process without compromising semantic fidelity.	AnyArt significantly outperforms existing STIG methods in both one-shot and multi-shot settings, demonstrating superior style similarity and semantic consistency. The multi-source style embedder effectively captures diverse style patterns while minimizing semantic leakage from reference images. The dynamic attention adapter successfully balances style influence and semantic fidelity, ensuring that generated images adhere to both style and text prompts.	The patch-level transformer used in the multi-source style embedder limits its efficiency in handling a large number of reference images. The current method solely focuses on image-based style conditions, neglecting other potential modalities like text, videos, or 3D data.	stylized text-to-image generation, stable diffusion, style embedding, dynamic attention adaptation, semantic consistency
2405.15234 Report	Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models	Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu	Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: https://github.com/OPTML-Group/AdvUnlearn	This paper presents AdvUnlearn, a novel framework integrating adversarial training (AT) into diffusion model (DM) unlearning, enhancing the robustness of concept erasure against adversarial prompt attacks while preserving image generation quality.	Existing concept erasure techniques in DMs are vulnerable to adversarial attacks, which can prompt the regeneration of undesired content. This underscores the need for more robust unlearning methods to ensure the safe and ethical deployment of DMs.	The paper proposes a bi-level optimization approach for AdvUnlearn, addressing effectiveness and efficiency challenges. It introduces a utility-retaining regularization using an external retain prompt set to balance robustness and utility. It also identifies the text encoder as a more suitable module for robustification than UNet, allowing for plug-and-play robust unlearning across different DM types.	AdvUnlearn significantly improves the robustness of concept-erased DMs against adversarial attacks, evidenced by reduced attack success rates across various unlearning scenarios (nudity, objects, style). The framework effectively balances robustness with image generation quality, demonstrated by comparable FID and CLIP scores to the original DM, unlike some baselines that sacrifice utility for robustness. AdvUnlearn's text encoder, finetuned for unlearning on one DM, demonstrates promising plug-and-play capability, transferring robustness to other DM types without additional finetuning.	The computational cost of AdvUnlearn is high due to adversarial training and utility-retaining regularization, requiring further research into optimization for practical deployment. While AdvUnlearn exhibits effectiveness in the studied scenarios, exploring its generalization to a wider range of concepts and attack strategies is crucial for future work.	diffusion models, machine unlearning, concept erasing, adversarial prompt attacks, robustness
2405.15232 Report	DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception	Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui	The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple and effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and another well-known benchmark, POPE, for object hallucination. Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10%), and a smaller base model size.	This paper presents DEEM, a novel method that leverages diffusion models to improve the robustness and hallucination recognition abilities of large multimodal models (LMMs) by aligning the semantic distributions of image encoders.	Existing LMMs often struggle with out-of-distribution data due to their reliance on image encoders that disregard irrelevant details. DEEM addresses this limitation by using diffusion models as an additional "eye" for LMMs to correct potential semantic bias in image encoding.	DEEM uses a three-stage training process: image-text alignment pre-training, image-text instruction fine-tuning, and mask-text instruction fine-tuning. It leverages a VFM-based image encoder, an LLM-based multimodal decoder, and a DM-based image decoder. A consistency semantic regularization term ensures the alignment between the image encoder's semantic information and the diffusion model's generative feedback.	DEEM demonstrates enhanced robustness and a superior capacity to alleviate model hallucinations compared to state-of-the-art interleaved image-text modeling models. DEEM achieves these improvements while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size. After supervised fine-tuning, DEEM achieves competitive performance on various multimodal tasks, including visual question-answering, region-level image captioning, and text-to-image generation.	While DEEM improves robustness, it cannot completely eliminate the robustness knowledge forgetting issue caused by subsequent fine-tuning. Updating larger image encoders with DEEM can increase the training overhead, potentially limiting its applicability in certain scenarios.	large multimodal models, diffusion models, robustness, hallucination recognition, semantic alignment
2405.15223 Report	iVideoGPT: Interactive VideoGPTs are Scalable World Models	Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long	World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.	Introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer architecture for building interactive world models by integrating visual observations, actions, and rewards into a token sequence, featuring a novel compressive tokenization technique for efficiency.	Addresses the limitations of existing world models that struggle to balance interactivity and scalability, bridging the gap between generative video models and practical model-based reinforcement learning.	Presents a two-phase approach: 1) pre-training iVideoGPT on a massive dataset of human and robotic manipulation trajectories for action-free video prediction, and 2) adapting the pre-trained model to downstream tasks like action-conditioned video prediction, visual planning, and model-based RL.	Achieves competitive performance in video prediction on BAIR and RoboNet datasets compared to state-of-the-art methods. Demonstrates effective visual planning capabilities, outperforming baselines in certain RoboDesk tasks and showing comparable performance to top models in the VP$^2$ benchmark. Shows significant improvement in sample efficiency for visual model-based RL on Meta-World tasks, matching or surpassing DreamerV3 performance and showcasing the potential of decoupling model and policy learning with powerful world models.	Limited diversity in current pre-training data, particularly in publicly available robotic datasets, calls for incorporating more extensive and diverse data sources. Compressive tokenization's assumption that initial frames provide sufficient context for future predictions may not hold for scenarios with long videos and significant camera movements, suggesting further exploration of keyframe extraction techniques.	world models, video prediction, reinforcement learning, transformers, computer vision
2405.15217 Report	NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation	Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanxuan Zhao, Evangelos Kalogerakis, Michal Lukac	The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations, such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data, directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization via Score Distillation Sampling (SDS) is also fraught with difficulty, as vector representations are non trivial to directly optimize and tend to result in implausible geometries such as redundant or self-intersecting shapes. NIVeL addresses these challenges by reinterpreting the problem on an alternative, intermediate domain which preserves the desirable properties of vector graphics -- mainly sparsity of representation and resolution-independence. This alternative domain is based on neural implicit fields expressed in a set of decomposable, editable layers. Based on our experiments, NIVeL produces text-to-vector graphics results of significantly better quality than the state-of-the-art.	This paper introduces NIVeL, a novel method for text-to-vector graphics generation that uses neural implicit vector layers.	Existing methods struggle to generate high-quality vector graphics from text due to the variable structure of vector representations and the lack of large-scale training data. NIVeL addresses these challenges by using an intermediate, vector-like representation based on neural implicit fields.	NIVeL represents shapes as 2D continuous implicit functions, organized in a layered structure, and leverages score distillation sampling (SDS) from a pre-trained image-based diffusion model to optimize the parameters of these implicit functions. A key innovation is the use of a low-frequency implicit RGB image generator for initialization, leading to more semantically meaningful layers and improved final results.	NIVeL outperforms state-of-the-art methods like VectorFusion in terms of CLIP-based metrics and perceptual quality, as demonstrated by user studies. The method effectively generates clean, editable, and semantically meaningful vector graphics from text prompts, even with a low parameter count. Ablation studies highlight the importance of the proposed initialization strategy for achieving high-quality results and avoiding common failure modes.	The representation is currently limited by a fixed upper bound on the number of layers. Future work could explore a differentiable implicit-to-vector module for converting the implicit field to parametric curves.	vector graphics, text-to-image synthesis, diffusion models, neural implicit fields, score distillation sampling
2405.15176 Report	MonoDETRNext: Next-generation Accurate and Efficient Monocular 3D Object Detection Method	Pan Liao, Feng Yang, Di Wu, Liu Bo	Monocular vision-based 3D object detection is crucial in various sectors, yet existing methods face significant challenges in terms of accuracy and computational efficiency. Building on the successful strategies in 2D detection and depth estimation, we propose MonoDETRNext, which seeks to optimally balance precision and processing speed. Our methodology includes the development of an efficient hybrid visual encoder, enhancement of depth prediction mechanisms, and introduction of an innovative query generation strategy, augmented by an advanced depth predictor. Building on MonoDETR, MonoDETRNext introduces two variants: MonoDETRNext-F, which emphasizes speed, and MonoDETRNext-A, which focuses on precision. We posit that MonoDETRNext establishes a new benchmark in monocular 3D object detection and opens avenues for future research. We conducted an exhaustive evaluation demonstrating the model's superior performance against existing solutions. Notably, MonoDETRNext-A demonstrated a 4.60% improvement in the AP3D metric on the KITTI test benchmark over MonoDETR, while MonoDETRNext-F showed a 2.21% increase. Additionally, the computational efficiency of MonoDETRNext-F slightly exceeds that of its predecessor.	Proposes MonoDETRNext, a monocular 3D object detection model with two variants: MonoDETRNext-F (speed-focused) and MonoDETRNext-A (accuracy-focused), improving upon MonoDETR.	Monocular 3D object detection is crucial for applications with limited resources, but existing methods struggle with accuracy and efficiency.	Develops an efficient hybrid visual encoder, enhances depth prediction mechanisms, and introduces a novel query generation strategy augmented by an advanced depth predictor.	MonoDETRNext-A shows 4.60% improvement in AP3D on KITTI over MonoDETR. MonoDETRNext-F shows 2.21% improvement in AP3D on KITTI over MonoDETR. MonoDETRNext-F slightly surpasses MonoDETR in computational efficiency.	Accuracy gap persists compared to multi-view or sensor fusion methods. Limited dataset availability for evaluation and comparison with other monocular methods.	3d object detection, monocular vision, depth prediction, efficient encoder, query generation
2405.15125 Report	HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting	Yuanhao Cai, Zihao Xiao, Yixun Liang, Yulun Zhang, Xiaokang Yang, Yaoyao Liu, Alan Yuille	High dynamic range (HDR) novel view synthesis (NVS) aims to create photorealistic images from novel viewpoints using HDR imaging techniques. The rendered HDR images capture a wider range of brightness levels containing more details of the scene than normal low dynamic range (LDR) images. Existing HDR NVS methods are mainly based on NeRF. They suffer from long training time and slow inference speed. In this paper, we propose a new framework, High Dynamic Range Gaussian Splatting (HDR-GS), which can efficiently render novel HDR views and reconstruct LDR images with a user input exposure time. Specifically, we design a Dual Dynamic Range (DDR) Gaussian point cloud model that uses spherical harmonics to fit HDR color and employs an MLP-based tone-mapper to render LDR color. The HDR and LDR colors are then fed into two Parallel Differentiable Rasterization (PDR) processes to reconstruct HDR and LDR views. To establish the data foundation for the research of 3D Gaussian splatting-based methods in HDR NVS, we recalibrate the camera parameters and compute the initial positions for Gaussian point clouds. Experiments demonstrate that our HDR-GS surpasses the state-of-the-art NeRF-based method by 3.84 and 1.91 dB on LDR and HDR NVS while enjoying 1000x inference speed and only requiring 6.3% training time.	This paper introduces HDR-GS, the first Gaussian Splatting-based framework for efficient high dynamic range (HDR) novel view synthesis.	Existing HDR novel view synthesis methods, primarily based on NeRF, suffer from long training times and slow inference speeds, limiting their practical applications.	HDR-GS utilizes a Dual Dynamic Range (DDR) Gaussian point cloud model to jointly represent HDR and LDR colors. It employs spherical harmonics for HDR color and an MLP-based tone-mapper for LDR color rendering. Two parallel differentiable rasterization processes then generate HDR and LDR views from these colors. Additionally, the paper recalibrates camera parameters and utilizes SfM points for initializing 3D Gaussians, addressing limitations of previous datasets.	HDR-GS outperforms state-of-the-art NeRF-based methods by 1.91 dB on HDR novel view synthesis. HDR-GS achieves a 1000x faster inference speed compared to NeRF-based counterparts. HDR-GS significantly reduces training time, requiring only 6.3% of the time needed for SOTA methods.	The paper mainly focuses on static scenes. Future work could explore the application of HDR-GS in dynamic scene modeling.	novel view synthesis, high dynamic range imaging, gaussian splatting, 3d reconstruction, computer vision
2405.15118 Report	GS-Hider: Hiding Messages into 3D Gaussian Splatting	Xuanyu Zhang, Jiarui Meng, Runyi Li, Zhipei Xu, Yongbing Zhang, Jian Zhang	3D Gaussian Splatting (3DGS) has already become the emerging research focus in the fields of 3D scene reconstruction and novel view synthesis. Given that training a 3DGS requires a significant amount of time and computational cost, it is crucial to protect the copyright, integrity, and privacy of such 3D assets. Steganography, as a crucial technique for encrypted transmission and copyright protection, has been extensively studied. However, it still lacks profound exploration targeted at 3DGS. Unlike its predecessor NeRF, 3DGS possesses two distinct features: 1) explicit 3D representation; and 2) real-time rendering speeds. These characteristics result in the 3DGS point cloud files being public and transparent, with each Gaussian point having a clear physical significance. Therefore, ensuring the security and fidelity of the original 3D scene while embedding information into the 3DGS point cloud files is an extremely challenging task. To solve the above-mentioned issue, we first propose a steganography framework for 3DGS, dubbed GS-Hider, which can embed 3D scenes and images into original GS point clouds in an invisible manner and accurately extract the hidden messages. Specifically, we design a coupled secured feature attribute to replace the original 3DGS's spherical harmonics coefficients and then use a scene decoder and a message decoder to disentangle the original RGB scene and the hidden message. Extensive experiments demonstrated that the proposed GS-Hider can effectively conceal multimodal messages without compromising rendering quality and possesses exceptional security, robustness, capacity, and flexibility. Our project is available at: https://xuanyuzhang21.github.io/project/gshider.	This paper introduces GS-Hider, a novel steganography framework for 3D Gaussian Splatting (3DGS) capable of concealing 3D scenes or images within other 3D scenes.	Protecting the copyright and privacy of 3D assets is crucial due to the high cost of rendering 3DGS. This method offers a solution for secure communication and copyright protection of 3D scenes.	GS-Hider replaces the original 3DGS spherical harmonics coefficients with a coupled secured feature attribute. It then utilizes a scene decoder and a private message decoder to disentangle the original and hidden content from the coupled features.	GS-Hider achieves high fidelity, with minimal degradation in rendering quality compared to the original 3DGS. The method ensures robust security, making it difficult for unauthorized users to extract the hidden content. GS-Hider demonstrates large capacity and versatility by enabling the hiding of single images or even multiple 3D scenes within a single 3D scene.	The current approach does not consider view dependency, potentially impacting rendering quality. Rendering speed is slightly reduced due to high-dimensional feature rasterization and multi-layer convolutional decoding, though it remains within real-time requirements.	3d gaussian splatting, steganography, copyright protection, secure communication, 3d scene reconstruction
2405.15056 Report	ElastoGen: 4D Generative Elastodynamics	Yutao Feng, Yintong Shang, Xiang Feng, Lei Lan, Shandian Zhe, Tianjia Shao, Hongzhi Wu, Kun Zhou, Hao Su, Chenfanfu Jiang, Yin Yang	We present ElastoGen, a knowledge-driven model that generates physically accurate and coherent 4D elastodynamics. Instead of relying on petabyte-scale data-driven learning, ElastoGen leverages the principles of physics-in-the-loop and learns from established physical knowledge, such as partial differential equations and their numerical solutions. The core idea of ElastoGen is converting the global differential operator, corresponding to the nonlinear elastodynamic equations, into iterative local convolution-like operations, which naturally fit modern neural networks. Each network module is specifically designed to support this goal rather than functioning as a black box. As a result, ElastoGen is exceptionally lightweight in terms of both training requirements and network scale. Additionally, due to its alignment with physical procedures, ElastoGen efficiently generates accurate dynamics for a wide range of hyperelastic materials and can be easily integrated with upstream and downstream deep modules to enable end-to-end 4D generation.	ElastoGen is a knowledge-driven model for generating physically accurate and coherent 4D elastodynamics, leveraging physical laws and principles instead of petabyte-scale data.	Learning physical dynamics from observable data is challenging due to noise and agnostic underlying coherence. Existing deep models struggle with temporal consistency and require vast data. ElastoGen addresses these issues by incorporating established physical knowledge.	ElastoGen converts the global differential operator of elastodynamic equations into iterative local convolution-like operations. It utilizes a neural metric with diffusion-based parameterization and a general subspace method for efficient matrix-free computation.	ElastoGen generates accurate elastodynamics for various shapes and hyperelastic materials (Neo-Hookean, StVK, etc.) with minimal parameterization and lightweight training. The model is compatible with different geometric representations, including voxels, implicit NeRFs, and complex explicit meshes. Experiments demonstrate that ElastoGen produces results comparable to traditional FEM simulations while offering greater efficiency.	ElastoGen currently lacks support for collisions, limiting its applicability in scenarios involving interacting objects. The computational efficiency can be further improved, especially for large, sparse models where convolutions over empty voxels are computationally expensive.	physics-based simulation, 4d generation, elastodynamics, diffusion models, neural networks
2405.15020 Report	AdjointDEIS: Efficient Gradients for Diffusion Models	Zander W. Blasingame, Chen Liu	The optimization of the latents and parameters of diffusion models with respect to some differentiable metric defined on the output of the model is a challenging and complex problem. The sampling for diffusion models is done by solving either the probability flow ODE or diffusion SDE wherein a neural network approximates the score function or related quantity, allowing a numerical ODE/SDE solver to be used. However, na\"ive backpropagation techniques are memory intensive, requiring the storage of all intermediate states, and face additional complexity in handling the injected noise from the diffusion term of the diffusion SDE. We propose a novel method based on the stochastic adjoint sensitivity method to calculate the gradientwith respect to the initial noise, conditional information, and model parameters by solving an additional SDE whose solution is the gradient of the diffusion SDE. We exploit the unique construction of diffusion SDEs to further simplify the formulation of the adjoint diffusion SDE and use a change-of-variables to simplify the solution to an exponentially weighted integral. Using this formulation we derive a custom solver for the adjoint SDE as well as the simpler adjoint ODE. The proposed adjoint diffusion solvers can efficiently compute the gradients for both the probability flow ODE and diffusion SDE for latents and parameters of the model. Lastly, we demonstrate the effectiveness of the adjoint diffusion solvers onthe face morphing problem.	The paper introduces AdjointDEIS, a novel method for calculating gradients of diffusion models with respect to latents and parameters by solving an adjoint diffusion SDE, enabling efficient optimization of diffusion models.	Optimizing diffusion models for specific tasks is challenging due to memory-intensive backpropagation and complexities in handling diffusion noise. AdjointDEIS addresses these challenges, enabling guided generation and adaptation of pre-trained models.	The authors leverage the stochastic adjoint sensitivity method to derive an adjoint probability flow ODE and its simplified formulation using exponential integrators. They propose custom first and second-order solvers for both ODE and SDE settings.	AdjointDEIS is the first general backpropagation technique for diffusion models using SDE solvers, providing gradients for network weights, conditional information, and noisy states. Custom solvers for the adjoint ODE/SDE demonstrate efficient computation of gradients. AdjointDEIS shows effectiveness in guided generation, specifically for face morphing attacks, outperforming existing methods in visual quality and attack efficacy.	Further analysis is needed to evaluate AdjointDEIS on diverse guided generation tasks. Theoretical convergence rates for the proposed solvers are not yet established.	diffusion models, adjoint sensitivity method, guided generation, face morphing attack, exponential integrators
2405.14979 Report	CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner	Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, Xiaoxiao Long	We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling software. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image and leverages a powerful multi-view (MV) diffusion model to generate multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high efficacy in producing superior-quality 3D assets compared to existing methods. HomePage: https://craftsman3d.github.io/, Code: https://github.com/wyysf-98/CraftsMan	CraftsMan, a novel generative 3D modeling system that generates high-fidelity 3D geometries from a single image or text prompt, featuring regular mesh topologies, detailed surfaces, and interactive refinement capabilities.	Existing 3D generation methods struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, limiting their practical use.	The system uses a two-stage process: (1) A 3D native diffusion model, conditioned on multi-view images from a multi-view diffusion model, generates coarse 3D geometries. (2) A normal-based geometry refiner, leveraging ControlNet-tile and surface normal map diffusion, enhances surface details either automatically or interactively based on user edits.	Generates high-fidelity 3D geometries with regular mesh topologies and detailed surfaces in 30 seconds. Exhibits superior quality and detail richness compared to existing 3D generative and reconstruction models, as demonstrated by qualitative and quantitative evaluations. Offers interactive refinement tools, such as the Magic Normal Brush, allowing users to efficiently edit specific areas of the generated mesh.	Limited controllability of the Latent Set Diffusion model. Future work includes exploring texture generation for 3D meshes.	3d generation, diffusion models, mesh refinement, interactive modeling, generative ai
2405.14874 Report	Investigating Robustness of Open-Vocabulary Foundation Object Detectors under Distribution Shifts	Prakash Chandra Chhipa, Kanjar De, Meenakshi Subhash Chippa, Rajkumar Saini, Marcus Liwicki	The challenge of Out-Of-Distribution (OOD) robustness remains a critical hurdle towards deploying deep vision models. Open-vocabulary object detection extends the capabilities of traditional object detection frameworks to recognize and classify objects beyond predefined categories. Investigating OOD robustness in open-vocabulary object detection is essential to increase the trustworthiness of these models. This study presents a comprehensive robustness comparison of zero-shot capabilities of three recent open-vocabulary foundation object detection models, namely OWL-ViT, YOLO World, and Grounding DINO. Experiments carried out on the COCO-O and COCO-C benchmarks encompassing distribution shifts highlight the challenges of the models' robustness. Source code shall be made available to the research community on GitHub.	This paper presents a comparative robustness analysis of three state-of-the-art open-vocabulary object detection models (OWL-ViT, YOLO World, and Grounding DINO) under out-of-distribution (OOD) conditions.	Investigating OOD robustness in open-vocabulary object detection is crucial for increasing the trustworthiness and reliability of these models in real-world applications where they might encounter unseen data.	The authors evaluate the zero-shot performance of the models on the COCO-O and COCO-C benchmarks, which introduce distribution shifts through various image degradations and corruptions.	All three models exhibit significant performance drops on OOD data, indicating a need for improved robustness. Grounding DINO demonstrates the highest robustness, maintaining performance closer to its original COCO results compared to the other models. The study highlights the increasing susceptibility of the models to performance degradation as the severity of corruptions increases.	The study primarily focuses on zero-shot evaluation and could be extended to include few-shot learning scenarios. Future work could explore techniques to enhance the robustness of open-vocabulary object detectors, such as prompt engineering or incorporating robustness-enhancing training strategies.	open-vocabulary object detection, out-of-distribution robustness, zero-shot learning, distribution shift, computer vision
2405.14871 Report	NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections	Dor Verbin, Pratul P. Srinivasan, Peter Hedman, Ben Mildenhall, Benjamin Attal, Richard Szeliski, Jonathan T. Barron	Neural Radiance Fields (NeRFs) typically struggle to reconstruct and render highly specular objects, whose appearance varies quickly with changes in viewpoint. Recent works have improved NeRF's ability to render detailed specular appearance of distant environment illumination, but are unable to synthesize consistent reflections of closer content. Moreover, these techniques rely on large computationally-expensive neural networks to model outgoing radiance, which severely limits optimization and rendering speed. We address these issues with an approach based on ray tracing: instead of querying an expensive neural network for the outgoing view-dependent radiance at points along each camera ray, our model casts reflection rays from these points and traces them through the NeRF representation to render feature vectors which are decoded into color using a small inexpensive network. We demonstrate that our model outperforms prior methods for view synthesis of scenes containing shiny objects, and that it is the only existing NeRF method that can synthesize photorealistic specular appearance and reflections in real-world scenes, while requiring comparable optimization time to current state-of-the-art view synthesis models.	The paper introduces NeRF-Casting, a novel NeRF-based method that uses ray tracing to improve the rendering of specular reflections in 3D scenes.	Existing NeRF models struggle to efficiently and accurately reconstruct and render scenes containing highly specular, glossy objects.	Instead of relying on large, expensive MLPs, NeRF-Casting casts reflection rays from points along camera rays, tracing them through the learned NeRF representation. This allows for the synthesis of consistent reflections from both near-field and distant scene content. To enhance efficiency and prevent aliasing, the method employs directional sampling and feature downweighting techniques.	Outperforms prior methods in view synthesis of scenes with shiny objects, particularly excelling in synthesizing high-quality reflections of nearby content. Demonstrates a qualitative improvement over existing techniques in achieving realistic and consistent motion of reflections as the camera moves through the scene. Achieves comparable optimization time to state-of-the-art view synthesis models while requiring less compute during inference.	Limitations: Struggles to render semi-transparent surfaces due to reflecting from a single expected termination point per ray. Future work: Addressing the visibility of the camera in reflections, which is not currently accounted for by the model.	neural radiance fields, view synthesis, reflections, ray tracing, specular appearance
2405.14868 Report	Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis	Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick	Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose $\textbf{GCD}$, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.	This paper presents Generative Camera Dolly (GCD), a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to generate synchronous videos from arbitrary perspectives given a single video of a scene and relative camera pose parameters.	Accurate reconstruction of complex dynamic scenes from a single viewpoint is crucial for applications in robotics, autonomous driving, and immersive VR experiences. Existing methods often require multi-view videos or are limited to small viewpoint changes, restricting their practical utility.	GCD leverages a pre-trained video diffusion model (Stable Video Diffusion) and fine-tunes it on paired videos from simulations. The model utilizes a novel micro-conditioning mechanism to control camera parameters and learns to generate videos from novel viewpoints by gradually interpolating between source and target camera poses.	GCD achieves state-of-the-art results on monocular dynamic view synthesis, outperforming baselines by a large margin on Kubric-4D and ParallelDomain-4D datasets. The model demonstrates strong generalization capabilities, producing plausible novel views for various real-world videos, including driving, indoor, and robotic manipulation scenes. GCD effectively handles large camera viewpoint changes, revealing unseen portions of the scene and reconstructing occluded objects.	GCD may struggle with out-of-distribution real-world videos, particularly those involving highly deformable objects or complex human motion. The model's performance can be sensitive to the choice of camera trajectory and interpolation method.	dynamic view synthesis, video diffusion models, monocular depth estimation, camera pose control, scene understanding
2405.14866 Report	Tele-Aloha: A Low-budget and High-authenticity Telepresence System Using Sparse RGB Cameras	Hanzhang Tu, Ruizhi Shao, Xue Dong, Shunyuan Zheng, Hao Zhang, Lili Chen, Meili Wang, Wenyu Li, Siyan Ma, Shengping Zhang, Boyao Zhou, Yebin Liu	In this paper, we present a low-budget and high-authenticity bidirectional telepresence system, Tele-Aloha, targeting peer-to-peer communication scenarios. Compared to previous systems, Tele-Aloha utilizes only four sparse RGB cameras, one consumer-grade GPU, and one autostereoscopic screen to achieve high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms) and robust distant communication. As the core of Tele-Aloha, we propose an efficient novel view synthesis algorithm for upper-body. Firstly, we design a cascaded disparity estimator for obtaining a robust geometry cue. Additionally a neural rasterizer via Gaussian Splatting is introduced to project latent features onto target view and to decode them into a reduced resolution. Further, given the high-quality captured data, we leverage weighted blending mechanism to refine the decoded image into the final resolution of 2K. Exploiting world-leading autostereoscopic display and low-latency iris tracking, users are able to experience a strong three-dimensional sense even without any wearable head-mounted display device. Altogether, our telepresence system demonstrates the sense of co-presence in real-life experiments, inspiring the next generation of communication.	Tele-Aloha, a low-budget and high-authenticity bidirectional telepresence system using sparse RGB cameras for peer-to-peer communication.	Existing telepresence systems are often expensive, require complex hardware setups, and rely on depth sensors that can be sensitive to environmental factors.	Utilizes four sparse RGB cameras, a consumer-grade GPU, and an autostereoscopic screen. Develops a novel view synthesis algorithm with a cascaded disparity estimator for robust geometry cues and a neural rasterizer based on 3D Gaussian Splatting for high-quality rendering.	Achieves high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms) performance. Produces competitive depth maps compared to TOF sensors using only RGB cameras. Outperforms other efficient novel view synthesis algorithms in terms of rendering quality on a synthetic dataset.	System may fail on specular objects due to challenges in disparity estimation. Potential issues with inaccurate background segmentation can lead to artifacts.	telepresence, videoconferencing, novel view synthesis, 3d gaussian splatting, rgb-only
2405.14858 Report	Mamba-R: Vision Mamba ALSO Needs Registers	Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie	Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba-R. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba-R attains stronger performance and scales better. For example, on the ImageNet benchmark, our base-size Mamba-R attains 82.9% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 341M parameters), attaining a competitive accuracy of 83.2% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports Mamba-R's efficacy.	This paper identifies severe feature artifacts in Vision Mamba models, similar to but worse than those in ViTs, and proposes Mamba®—a novel architecture that incorporates register tokens to mitigate this issue.	Addressing the artifact issue in Vision Mamba is crucial as these artifacts hinder feature extraction, limit scalability, and negatively impact performance.	The paper introduces Mamba®, which builds upon Vision Mamba by incorporating two key modifications: 1) evenly inserting register tokens throughout the input token sequence and 2) recycling registers for final decision predictions.	Mamba® effectively suppresses artifacts, resulting in cleaner feature maps that focus on semantically meaningful image regions. Quantitatively, Mamba® significantly outperforms vanilla Vision Mamba on ImageNet, achieving 82.9% accuracy for the Base model. Mamba® exhibits superior scalability compared to previous Vision Mamba models, effectively scaling to a Large size with 341M parameters and achieving 83.2% accuracy on ImageNet.	The paper primarily focuses on image classification, leaving exploration of Mamba® in other vision tasks for future work. Further investigation into the interpretability of register tokens, particularly their potential for multi-head-like behavior, is warranted.	vision mamba, state space models, feature artifacts, register tokens, image classification, semantic segmentation
2405.14857 Report	Semantica: An Adaptable Image-Conditioned Diffusion Model	Manoj Kumar, Neil Houlsby, Emiel Hoogeboom	We investigate the task of adapting image generative models to different datasets without finetuneing. To this end, we introduce Semantica, an image-conditioned diffusion model capable of generating images based on the semantics of a conditioning image. Semantica is trained exclusively on web-scale image pairs, that is it receives a random image from a webpage as conditional input and models another random image from the same webpage. Our experiments highlight the expressivity of pretrained image encoders and necessity of semantic-based data filtering in achieving high-quality image generation. Once trained, it can adaptively generate new images from a dataset by simply using images from that dataset as input. We study the transfer properties of Semantica on ImageNet, LSUN Churches, LSUN Bedroom and SUN397.	This paper introduces Semantica, an image-conditioned diffusion model that adapts to different datasets without finetuning by leveraging semantic information from a conditioning image.	Adapting generative models to new datasets usually requires finetuning, which is impractical for large models and datasets. This paper explores an alternative based on in-context learning.	Semantica consists of a pretrained image encoder (DINOv2) and a diffusion model. It is trained on image pairs from the same webpage, learning to generate images that share semantic content. Data filtering based on semantic similarity is used to improve performance.	Token-level conditioning from the image encoder outperforms global feature conditioning. Semantic data filtering significantly improves generation quality. Semantica generalizes well to unseen datasets, outperforming a label-conditioned baseline on out-of-distribution datasets.	Training Semantica requires significant computational resources. The model relies on a frozen encoder, which could limit performance.	generative models, diffusion models, image generation, transfer learning, in-context learning
2405.14855 Report	Synergistic Global-space Camera and Human Reconstruction from Videos	Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang	Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatio-temporal coherency and dynamic scene constraints. Together, they lead to consistent reconstructions of camera trajectories, human meshes, and dense scene point clouds in a common world frame. Project page: https://paulchhuang.github.io/synchmr	This paper introduces SynCHMR, a novel pipeline that reconstructs metric-scale camera trajectories, human meshes, and dense scene point clouds from monocular videos by jointly optimizing human mesh recovery and SLAM.	Existing methods for reconstructing humans and scenes from videos often treat these problems independently, leading to inconsistencies and ambiguities in scale, depth, and dynamic movements.	The pipeline uses camera-frame human mesh estimates as a prior to disambiguate SLAM and calibrate depth. Subsequently, a Scene-aware SMPL Denoiser refines human mesh poses in the world frame by leveraging the reconstructed dynamic scene.	SynCHMR outperforms state-of-the-art methods in global human motion estimation on EgoBody dataset. Human-aware Metric SLAM effectively calibrates monocular depth and improves camera pose estimation. Scene-aware SMPL Denoiser effectively leverages scene information to improve human mesh denoising.	The method currently uses an approximated focal length, potentially limiting accuracy in cases with significant perspective distortion. Handling subjects with body shapes not well-represented in the SMPL model remains an open challenge.	human mesh recovery, slam, 3d human reconstruction, scene reconstruction, monocular video
2405.14854 Report	TerDiT: Ternary Diffusion Models with Transformers	Xudong Lu, Aojun Zhou, Ziyi Lin, Qi Liu, Yuhui Xu, Renrui Zhang, Yafei Wen, Shuai Ren, Peng Gao, Junchi Yan, Hongsheng Li	Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at https://github.com/Lucky-Lance/TerDiT.	Presents TerDiT, a novel quantization-aware training and efficient deployment scheme for ternary diffusion transformer models, significantly reducing model size and memory consumption.	Large-scale DiT models, while powerful, are expensive to deploy due to their extensive parameter count. Quantization offers a solution, but existing methods are either limited to U-Net models or rely on less effective post-training techniques.	Leveraging quantization-aware training (QAT), the method ternarizes DiT network weights and introduces RMS normalization within the adaLN module to enhance training stability and performance. Deployment is achieved using a 2-bit implementation for practical efficiency.	TerDiT-4.2B achieves comparable image generation quality to full-precision DiT-XL/2 on ImageNet 256x256 benchmark, even with fewer training images. Deployment efficiency is significantly improved, with over tenfold reduction in checkpoint size and about sixfold reduction in inference memory consumption compared to full-precision counterparts. The introduced RMS normalized adaLN module is shown to accelerate convergence and enhance performance compared to directly ternarizing the adaLN module.	Training ternary DiT models remains less stable and more time-consuming than full-precision networks, demanding further research to improve training efficiency. Experiments are limited to ImageNet 256x256 resolution and label-conditioned generation due to computational resource constraints, leaving exploration of higher resolutions and text-to-image generation for future work.	diffusion models, quantization, ternary networks, diffusion transformer (dit), efficient deployment
2405.14832 Report	Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer	Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao	Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.	This paper introduces Direct3D, a novel image-to-3D generation method leveraging a native 3D diffusion model directly trained on a large-scale 3D dataset, bypassing the need for multi-view diffusion or SDS optimization.	Existing 3D generation methods struggle to achieve both high fidelity and generalizability due to limitations in 3D representations and reliance on indirect generation from multi-view images. Direct3D addresses these limitations, enabling high-quality 3D asset creation from in-the-wild images.	Direct3D employs a two-stage approach: 1) D3D-VAE encodes 3D shapes into a compact triplane latent space with direct geometry supervision, and 2) D3D-DiT, a 3D diffusion transformer, generates 3D shapes from this latent space conditioned on input images, incorporating both pixel-level and semantic-level image information.	Direct3D outperforms existing image-to-3D approaches in terms of generation quality and generalization ability on the GSO dataset. The method generalizes well to text-to-3D generation by utilizing text-to-image models, producing high-quality meshes consistent with the text prompts. Ablation studies demonstrate the effectiveness of the explicit triplane latent representation, semi-continuous surface sampling strategy, and the D3D-DiT architecture for achieving superior performance.	Direct3D is currently limited to generating individual or multiple objects and cannot handle large-scale scene generation. Future research will explore extending the method to enable large-scale scene generation.	3d generation, image-to-3d, text-to-3d, diffusion models, variational autoencoder
2405.14828 Report	Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models	Katherine Xu, Lingzhi Zhang, Jianbo Shi	Recent advances in text-to-image (T2I) diffusion models have facilitated creative and photorealistic image synthesis. By varying the random seeds, we can generate various images for a fixed text prompt. Technically, the seed controls the initial noise and, in multi-step diffusion inference, the noise used for reparameterization at intermediate timesteps in the reverse diffusion process. However, the specific impact of the random seed on the generated images remains relatively unexplored. In this work, we conduct a large-scale scientific study into the impact of random seeds during diffusion inference. Remarkably, we reveal that the best 'golden' seed achieved an impressive FID of 21.60, compared to the worst 'inferior' seed's FID of 31.97. Additionally, a classifier can predict the seed number used to generate an image with over 99.9% accuracy in just a few epochs, establishing that seeds are highly distinguishable based on generated images. Encouraged by these findings, we examined the influence of seeds on interpretable visual dimensions. We find that certain seeds consistently produce grayscale images, prominent sky regions, or image borders. Seeds also affect image composition, including object location, size, and depth. Moreover, by leveraging these 'golden' seeds, we demonstrate improved image generation such as high-fidelity inference and diversified sampling. Our investigation extends to inpainting tasks, where we uncover some seeds that tend to insert unwanted text artifacts. Overall, our extensive analyses highlight the importance of selecting good seeds and offer practical utility for image generation.	This paper presents the first large-scale analysis of random seeds in text-to-image diffusion models, revealing their significant impact on image quality, style, and composition.	Understanding the role of seeds enables the development of simple yet effective techniques to enhance image generation during inference without requiring model retraining or fine-tuning.	The authors generated over 46 million images using two diffusion models and various text prompts, then analyzed the impact of seeds on image features, style representations, and object compositions.	Seeds are highly discriminative, with a classifier achieving over 99.9% accuracy in predicting the seed from generated images. Specific seeds consistently produce stylistic patterns (e.g., grayscale, sky regions) and compositional elements (e.g., object location, size). Leveraging ‘golden’ seeds significantly improves image quality and human preference scores compared to random sampling.	The study primarily focuses on 1,024 seeds due to budget constraints, potentially limiting the generalizability of findings. The research relies on pretrained models trained on large-scale web data, which may contain biases and errors that could influence the results.	text-to-image synthesis, diffusion models, random seeds, image quality, image composition
2405.14793 Report	SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow	Yihan Wang, Lahav Lipson, Jia Deng	We introduce SEA-RAFT, a more simple, efficient, and accurate RAFT for optical flow. Compared with RAFT, SEA-RAFT is trained with a new loss (mixture of Laplace). It directly regresses an initial flow for faster convergence in iterative refinements and introduces rigid-motion pre-training to improve generalization. SEA-RAFT achieves state-of-the-art accuracy on the Spring benchmark with a 3.69 endpoint-error (EPE) and a 0.36 1-pixel outlier rate (1px), representing 22.9% and 17.8% error reduction from best published results. In addition, SEA-RAFT obtains the best cross-dataset generalization on KITTI and Spring. With its high efficiency, SEA-RAFT operates at least 2.3x faster than existing methods while maintaining competitive performance. The code is publicly available at https://github.com/princeton-vl/SEA-RAFT.	SEA-RAFT, a simpler, more efficient, and accurate variant of RAFT for optical flow estimation.	Achieves state-of-the-art accuracy and speed, making it useful for real-world high-resolution optical flow.	Introduces a mixture of Laplace loss to handle ambiguous cases, directly regresses initial flow for faster convergence, employs rigid-flow pre-training on TartanAir for better generalization, and simplifies RAFT's architecture.	Achieves state-of-the-art accuracy on Spring benchmark, outperforming the next best method by a large margin (18% error reduction on 1px outlier rate and 24% on endpoint error). Obtains the best cross-dataset generalization on KITTI and Spring. Operates at least 2.3x faster than existing methods while maintaining competitive performance.	Zero-shot performance on Sintel's final pass is not as competitive. Explore the reasons behind the performance gap on Sintel's final pass and investigate further improvements.	optical flow, raft, deep learning, computer vision, mixture of laplace loss
2405.14785 Report	EditWorld: Simulating World Dynamics for Instruction-Following Image Editing	Ling Yang, Bohan Zeng, Jiaming Liu, Hong Li, Minghao Xu, Wentao Zhang, Shuicheng Yan	Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work, EditWorld, introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. Our dataset and code will be available at https://github.com/YangLing0818/EditWorld	This paper proposes "world-instructed image editing," a new image editing task focusing on real-world and virtual-world dynamics beyond simple object manipulation.	Existing instruction-based image editing methods struggle to simulate realistic physical dynamics, limiting their ability to handle complex editing scenarios grounded in real-world logic.	The authors curate a new dataset with world instructions, utilizing GPT-3.5, SDXL, ControlNet, Video-LLava, and human re-checking. They finetune an InstructPix2Pix model and propose a "post-edit" strategy to refine results and preserve non-edited areas.	The proposed method significantly outperforms existing methods in CLIP score and a newly introduced "MLLM score" across various instruction categories. Qualitative analysis demonstrates the method's ability to handle complex edits grounded in world dynamics, surpassing baselines in visual quality and instruction following. Ablation study confirms the effectiveness of "post-edit" in preserving non-edited areas while maintaining editing quality.	The current dataset, while diverse, is limited in size and lacks precise editing examples for complex scenarios. Accurately evaluating subtle differences in world-instructed edits remains challenging, requiring further research in multimodal difference recognition.	image editing, diffusion models, world dynamics, instruction following, multimodal learning
2405.14739 Report	FLoRA: Low-Rank Core Space for N-dimension	Chongjie Si, Xuehui Wang, Xue Yang, Zhengqin Xu, Qingyun Li, Jifeng Dai, Yu Qiao, Xiaokang Yang, Wei Shen	Adapting pre-trained foundation models for various downstream tasks has been prevalent in artificial intelligence. Due to the vast number of tasks and high costs, adjusting all parameters becomes unfeasible. To mitigate this, several fine-tuning techniques have been developed to update the pre-trained model weights in a more resource-efficient manner, such as through low-rank adjustments. Yet, almost all of these methods focus on linear weights, neglecting the intricacies of parameter spaces in higher dimensions like 4D. Alternatively, some methods can be adapted for high-dimensional parameter space by compressing changes in the original space into two dimensions and then employing low-rank matrix decomposition. However, these approaches destructs the structural integrity of the involved high-dimensional spaces. To tackle the diversity of dimensional spaces across different foundation models and provide a more precise representation of the changes within these spaces, this paper introduces a generalized parameter-efficient fine-tuning framework, FLoRA, designed for various dimensional parameter space. Specifically, utilizing Tucker decomposition, FLoRA asserts that changes in each dimensional parameter space are based on a low-rank core space which maintains the consistent topological structure with the original space. It then models the changes through this core space alongside corresponding weights to reconstruct alterations in the original space. FLoRA effectively preserves the structural integrity of the change of original N-dimensional parameter space, meanwhile decomposes it via low-rank tensor decomposition. Extensive experiments on computer vision, natural language processing and multi-modal tasks validate FLoRA's effectiveness. Codes are available at https://github.com/SJTU-DeepVisionLab/FLoRA.	This paper proposes FLoRA, a novel parameter-efficient fine-tuning framework that utilizes Tucker decomposition to adapt pre-trained foundation models for diverse downstream tasks while preserving the structural integrity of parameter spaces in various dimensions.	Adapting large pre-trained models to various downstream tasks is computationally expensive. Existing fine-tuning techniques often focus on linear weights and neglect the structural intricacies of higher-dimensional parameter spaces, leading to suboptimal performance.	FLoRA leverages Tucker decomposition to represent changes in parameter spaces using a low-rank core tensor and corresponding weights, effectively preserving the topological structure of the original parameter space. This approach is applied to both linear and convolutional layers, demonstrating its versatility across different model architectures.	FLoRA consistently outperforms state-of-the-art parameter-efficient fine-tuning methods, including LoRA and DoRA, across computer vision, natural language processing, and multi-modal tasks. Empirical analysis reveals that FLoRA's low-rank representation captures task-specific information more effectively than competing methods, leading to improved performance. FLoRA demonstrates comparable training efficiency to existing methods while achieving superior results, highlighting its practicality for real-world applications.	The scaling factor in FLoRA currently needs to be tuned for different model backbones. Future work could explore a unified scaling strategy across diverse architectures.	parameter-efficient fine-tuning, foundation models, tucker decomposition, low-rank tensor decomposition, structural integrity preservation
2405.14705 Report	Learning Multi-dimensional Human Preference for Text-to-Image Generation	Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang	Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.	This paper introduces the Multi-dimensional Preference Score (MPS), a novel model for evaluating text-to-image models by considering multi-dimensional human preferences.	Existing evaluation metrics for text-to-image models often rely on statistical measures that do not fully capture the diverse preferences of humans.	The authors create the Multi-dimensional Human Preference (MHP) dataset, containing images annotated with preferences across aesthetics, detail quality, semantic alignment, and overall assessment. They then develop MPS, which leverages a condition mask to focus on prompt elements relevant to specific preference dimensions when predicting scores.	MPS outperforms existing scoring methods on three datasets in predicting overall preference. MPS demonstrates superior performance in evaluating multi-dimensional preferences compared to methods primarily focused on overall scores. Visualization reveals that MPS attends to different image and prompt regions based on the given preference condition, highlighting its ability to capture diverse preferences.	The current preference condition setting in MPS relies on predefined word sets, which might not encompass the full spectrum of human preferences. Future work can explore personalized preference learning, enabling MPS to adapt to individual user preferences.	text-to-image generation, evaluation metrics, human preferences, multi-dimensional preference score, vision-language models
2405.14701 Report	High Fidelity Scene Text Synthesis	Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin	Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training. Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications. Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios. To this end, this paper proposes DreamText for high-fidelity scene text synthesis. Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions. This transformation poses a hybrid optimization challenge, involving both discrete and continuous variables. To effectively tackle this challenge, we employ a heuristic alternate optimization strategy. Meanwhile, we jointly train the text encoder and generator to comprehensively learn and utilize the diverse font present in the training dataset. This joint training is seamlessly integrated into the alternate optimization process, fostering a synergistic relationship between learning character embedding and re-estimating character attention. Specifically, in each step, we first encode potential character-generated position information from cross-attention maps into latent character masks. These masks are then utilized to update the representation of specific characters in the current step, which, in turn, enables the generator to correct the character's attention in the subsequent steps. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art.	This paper proposes DreamText, a novel diffusion-based model for high-fidelity scene text synthesis that addresses the limitations of existing methods in accurately rendering text within complex scenes.	Existing methods struggle with character distortion, repetition, and absence due to insufficient character-level guidance during training and a limited ability to adapt to diverse font styles.	DreamText reconstructs the diffusion training process by introducing refined guidance through latent character masks. It employs a heuristic alternate optimization strategy to address the hybrid optimization problem and jointly trains the text encoder and generator to learn diverse font styles.	DreamText effectively alleviates character repetition, absence, and distortion issues. The heuristic alternate optimization strategy fosters a synergistic relationship between learning character representation and re-estimating character attention. A balanced supervision strategy strikes a balance between constraining the model and allowing flexibility in estimating optimal character positions.	DreamText currently lacks the capability to modify multiple regions within an image simultaneously. The generation of realistic text raises privacy concerns, demanding robust safeguards and ethical guidelines.	scene text synthesis, diffusion models, character attention, font diversity, heuristic optimization
2405.14677 Report	RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance	Zhicheng Sun, Zhenhao Yang, Yang Jin, Haozhe Chi, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Di Zhang, Yang Song, Kun Gai, Yadong Mu	Customizing diffusion models to generate identity-preserving images from user-provided reference images is an intriguing new problem. The prevalent approaches typically require training on extensive domain-specific images to achieve identity preservation, which lacks flexibility across different use cases. To address this issue, we exploit classifier guidance, a training-free technique that steers diffusion models using an existing classifier, for personalized image generation. Our study shows that based on a recent rectified flow framework, the major limitation of vanilla classifier guidance in requiring a special classifier can be resolved with a simple fixed-point solution, allowing flexible personalization with off-the-shelf image discriminators. Moreover, its solving procedure proves to be stable when anchored to a reference flow trajectory, with a convergence guarantee. The derived method is implemented on rectified flow with different off-the-shelf image discriminators, delivering advantageous personalization results for human faces, live subjects, and certain objects. Code is available at https://github.com/feifeiobama/RectifID.	The paper proposes a training-free personalized image generation method called anchored classifier guidance, which customizes rectified flow using off-the-shelf image discriminators.	The method addresses limitations of existing personalized image generation techniques that require extensive domain-specific training data or fine-tuning, enabling greater flexibility and identity consistency.	The method approximates rectified flow as ideally straight, reformulating classifier guidance as a fixed-point problem solved iteratively. It anchors the flow trajectory to a reference trajectory for improved stability and convergence.	The training-free method achieves state-of-the-art performance in face-centric personalization benchmarks, surpassing training-based methods in identity preservation. The approach demonstrates flexibility by effectively personalizing images with various subjects beyond human faces, including animals and regularly shaped objects. The method successfully extends to multi-subject personalization, composing multiple subjects into an image while maintaining identity and visual quality.	Theoretical guarantees are limited to ideal rectified flow and may not generalize to complex flow-based models. The method's effectiveness is currently limited for objects with large structural variations, and its inference time is not yet as fast as some training-based methods.	personalized image generation, rectified flow, classifier guidance, diffusion models, training-free
2405.14633 Report	Flatten Anything: Unsupervised Neural Surface Parameterization	Qijian Zhang, Junhui Hou, Wenping Wang, Ying He	Surface parameterization plays an essential role in numerous computer graphics and geometry processing applications. Traditional parameterization approaches are designed for high-quality meshes laboriously created by specialized 3D modelers, thus unable to meet the processing demand for the current explosion of ordinary 3D data. Moreover, their working mechanisms are typically restricted to certain simple topologies, thus relying on cumbersome manual efforts (e.g., surface cutting, part segmentation) for pre-processing. In this paper, we introduce the Flatten Anything Model (FAM), an unsupervised neural architecture to achieve global free-boundary surface parameterization via learning point-wise mappings between 3D points on the target geometric surface and adaptively-deformed UV coordinates within the 2D parameter domain. To mimic the actual physical procedures, we ingeniously construct geometrically-interpretable sub-networks with specific functionalities of surface cutting, UV deforming, unwrapping, and wrapping, which are assembled into a bi-directional cycle mapping framework. Compared with previous methods, our FAM directly operates on discrete surface points without utilizing connectivity information, thus significantly reducing the strict requirements for mesh quality and even applicable to unstructured point cloud data. More importantly, our FAM is fully-automated without the need for pre-cutting and can deal with highly-complex topologies, since its learning process adaptively finds reasonable cutting seams and UV boundaries. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our proposed neural surface parameterization paradigm. The code will be publicly available.	Introduces FAM, an unsupervised neural architecture for global free-boundary surface parameterization, learning point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates.	Addresses limitations of traditional parameterization methods that require high-quality meshes, manual pre-processing, and struggle with complex topologies.	Utilizes a bi-directional cycle mapping framework with sub-networks mimicking surface cutting, UV deforming, unwrapping, and wrapping, trained by minimizing various loss functions and enforcing differential geometric constraints.	Outperforms SLIM qualitatively and quantitatively in UV unwrapping and texture mapping on open surface models. Demonstrates universality and robustness in parameterizing surfaces with varying geometric and topological complexities. Shows the effectiveness of the bi-directional cycle mapping framework through ablation studies.	Current per-model overfitting limits generalization ability from existing UV unwrapping data. Future work includes incorporating advanced properties like shape symmetry, cutting seam visibility, and seamless parameterization.	surface parameterization, uv unwrapping, deep learning, cycle mapping, unsupervised learning
2405.14582 Report	PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Poses	Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Changwang Zhang, Chongxuan Li	In this paper, we introduce PoseCrafter, a one-shot method for personalized video generation following the control of flexible poses. Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos without the corresponding ground-truth frames. First, we select an appropriate reference frame from the training video and invert it to initialize all latent variables for generation. Then, we insert the corresponding training pose into the target pose sequences to enhance faithfulness through a trained temporal attention module. Furthermore, to alleviate the face and hand degradation resulting from discrepancies between poses of training videos and inference poses, we implement simple latent editing through an affine transformation matrix involving facial and hand landmarks. Extensive experiments on several datasets demonstrate that PoseCrafter achieves superior results to baselines pre-trained on a vast collection of videos under 8 commonly used metrics. Besides, PoseCrafter can follow poses from different individuals or artificial edits and simultaneously retain the human identity in an open-domain training video.	PoseCrafter is a one-shot method for generating personalized videos that follow flexible pose control, requiring only fine-tuning on a single video.	Existing methods struggle with data requirements, computational costs, and reliance on real video frames corresponding to target poses. This work offers a more efficient and flexible solution for personalized video generation.	The method uses Stable Diffusion and ControlNet, enhanced by a novel inference process involving reference-frame selection and insertion for faithfulness, and latent editing for refining face and hand details.	PoseCrafter outperforms baselines pre-trained on large video datasets on 8 common metrics, including MagicAnimate and Disco. It can effectively follow flexible pose control, including poses from the same or different individuals, and artificially designed poses. The method exhibits strong performance in preserving identity and details from the training video, even with limited training data.	Video quality is limited by the capabilities of the underlying ControlNet and diffusion models, particularly with complex poses. Large differences between training and inference poses can lead to degradation in generated video quality. Further research on constructing pseudo reference videos is needed.	personalized video generation, pose guidance, one-shot learning, diffusion models, latent editing
2405.14580 Report	LDM: Large Tensorial SDF Model for Textured Mesh Generation	Rengan Xie, Wenting Zheng, Kai Huang, Yizheng Chen, Qi Wang, Qi Ye, Wei Chen, Yuchi Huo	Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textured mesh from a single image or text prompts. We firstly utilize a multi-view diffusion model to generate sparse multi-view inputs from single images or text prompts, and then a transformer-based model is trained to predict a tensorial SDF field from these sparse multi-view image inputs. Finally, we employ a gradient-based mesh optimization layer to refine this model, enabling it to produce an SDF field from which high-quality textured meshes can be extracted. Extensive experiments demonstrate that our method can generate diverse, high-quality 3D mesh assets with corresponding decomposed RGB textures within seconds.	LDM, a novel feed-forward framework that generates high-fidelity, illumination-decoupled textured mesh from a single image or text prompts within seconds.	Existing methods for generating 3D assets from text or images often produce low-quality geometries or lack illumination-decoupled textures, limiting their use in applications requiring high-quality assets.	A multi-view diffusion model generates sparse multi-view images, a transformer-based model predicts a tensorial SDF field, and a gradient-based mesh optimization layer refines the SDF field for high-quality mesh extraction.	LDM generates high-quality 3D mesh assets with decomposed RGB textures in seconds. Tensorial SDF representation enhances object surface quality and convergence speed. Two-stage training strategy (volume rendering followed by gradient-based mesh optimization) improves geometric details and texture clarity.	Limited resolution of tensorial SDF tokens constrains final 3D asset resolution. Illumination decoupling module not designed for complex materials like translucent surfaces.	3d generation, text-to-3d, image-to-3d, tensorial sdf, illumination decoupling
2405.14554 Report	UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge	Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang	Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the problem, a promising solution is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4V by about 25% in accuracy.	This paper introduces UDKAG, an open-source framework to augment Large Vision-Language Models (LVLMs) with up-to-date knowledge for Visual Question Answering (VQA) tasks.	Existing LVLMs are often outdated due to infrequent updates, limiting their ability to answer questions about recent events or information. UDKAG aims to address this limitation by integrating internet search into the VQA process.	UDKAG employs a hierarchical filtering model: 1) A website filter scores and filters websites based on titles and snippets, 2) A content filter selects helpful content segments from the filtered websites. This content is then used to prompt LVLMs for more accurate answers.	UDKAG significantly improves the accuracy of various LVLMs on the UDK-VQA dataset, specifically designed for evaluating VQA performance on up-to-date knowledge. The hierarchical filtering model effectively identifies and extracts relevant information from websites, outperforming simpler IAG methods. Diversity selection within the framework ensures that LVLMs receive varied content, preventing bias and improving answer accuracy.	The hierarchical filtering model is trained separately from the LVLMs, potentially limiting performance. Future work could explore end-to-end training for better integration. The current implementation focuses on VQA tasks. Expanding the framework to other vision-language tasks would further enhance its applicability.	vision-language models, visual question answering, internet-augmented generation, up-to-date knowledge, hierarchical filtering
2405.14480 Report	Scalable Visual State Space Model with Fractal Scanning	Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li	Foundational models have significantly advanced in natural language processing (NLP) and computer vision (CV), with the Transformer architecture becoming a standard backbone. However, the Transformer's quadratic complexity poses challenges for handling longer sequences and higher resolution images. To address this challenge, State Space Models (SSMs) like Mamba have emerged as efficient alternatives, initially matching Transformer performance in NLP tasks and later surpassing Vision Transformers (ViTs) in various CV tasks. To improve the performance of SSMs, one crucial aspect is effective serialization of image patches. Existing methods, relying on linear scanning curves, often fail to capture complex spatial relationships and produce repetitive patterns, leading to biases. To address these limitations, we propose using fractal scanning curves for patch serialization. Fractal curves maintain high spatial proximity and adapt to different image resolutions, avoiding redundancy and enhancing SSMs' ability to model complex patterns accurately. We validate our method in image classification, detection, and segmentation tasks, and the superior performance validates its effectiveness.	This paper introduces a novel approach to enhance State Space Models (SSMs) for image processing by employing fractal scanning curves for image patch serialization, which surpasses the limitations of traditional linear scanning methods.	Effective serialization of image patches is crucial for SSMs in computer vision, as it directly impacts their ability to capture and model intricate spatial relationships within images. Existing linear scanning methods often fail to adequately preserve these relationships.	The study leverages the Hilbert curve, a type of fractal curve, for its inherent ability to maintain spatial and structural consistency across varying scales. A novel shifting operation is further implemented to refine the fractal curve, enhancing local adjacency and continuity during pixel serialization.	FractalMamba, the proposed model, outperforms several benchmark models, including those based on CNNs and ViTs, in image classification, object detection, and semantic segmentation tasks. FractalMamba exhibits superior scalability and efficiency when processing images of increasing resolutions, maintaining consistent performance with a near-linear increase in computational complexity. The implementation of a shifting operation on the fractal curves further improves the model's performance by mitigating the loss of local proximity information.	The generalizability of fractal scanning mechanisms across diverse visual data and tasks requires further investigation, as their performance may vary depending on dataset characteristics. Future research can explore additional fractal scanning methods and their combinations to potentially uncover even more effective serialization strategies for enhancing SSM performance.	state space models, fractal scanning curves, image serialization, computer vision, hilbert curve
2405.14475 Report	MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes	Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu	While controllable generative models for images and videos have achieved remarkable success, high-quality models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs. In this paper, we introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions. Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data. This innovative approach enables easily controllable generation and static scene acquisition, resulting in high-quality scene reconstruction. To address the minor errors in generated content, we propose deformable Gaussian splatting with monocular depth initialization and appearance modeling to manage exposure discrepancies across viewpoints. Validated on the nuScenes dataset, MagicDrive3D generates diverse, high-quality 3D driving scenes that support any-view rendering and enhance downstream tasks like BEV segmentation. Our results demonstrate the framework's superior performance, showcasing its transformative potential for autonomous driving simulation and beyond.	\methodname is a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions.	High-quality controllable generative models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs.	\methodname first trains a video generation model with enhanced inter-frame consistency using relative pose embedding. Then, it reconstructs the 3D scene using an enhanced deformable Gaussian Splatting technique, accounting for local dynamics and exposure discrepancies in the generated views.	\methodname generates realistic 3D street scenes with multi-dimensional controllability, as demonstrated by qualitative results and FID scores. The framework enhances the quality of video and scene generation, surpassing baseline methods like MagicDrive and 3DGS in metrics like FVD and FID. Generated scenes from \methodname can be used to improve the viewpoint robustness of perception tasks like BEV segmentation.	The model may struggle to generate complex objects or scenes with high texture detail due to limitations in the reconstruction method. Future work could focus on addressing these limitations and improving the quality and robustness of generated scenes.	3d scene generation, autonomous driving, controllable generation, gaussian splatting, view synthesis
2405.14458 Report	YOLOv10: Real-Time End-to-End Object Detection	Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding	Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model's capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1.8$\times$ faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8$\times$ smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46\% less latency and 25\% fewer parameters for the same performance.	This paper presents YOLOv10, a new generation of YOLO series for real-time end-to-end object detection, achieving state-of-the-art performance and efficiency.	Existing YOLO models suffer from suboptimal efficiency and accuracy due to the reliance on non-maximum suppression (NMS) for post-processing and computational redundancy in model architecture.	The authors propose consistent dual assignments for NMS-free training and a holistic efficiency-accuracy driven model design strategy. This includes a lightweight classification head, spatial-channel decoupled downsampling, rank-guided block design, large-kernel convolution, and a partial self-attention module.	YOLOv10 significantly outperforms previous state-of-the-art models in terms of computation-accuracy trade-offs. YOLOv10-S is 1.8x faster than RT-DETR-R18 under similar AP on COCO. Compared to YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance.	The paper does not investigate the pretraining of YOLOv10 on large-scale datasets. There is still a performance gap between NMS-free training and the original one-to-many training using NMS, especially in small models.	object detection, yolo, real-time, end-to-end, nms-free
2405.14455 Report	TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing	Teng Xu, Jiamin Chen, Peng Chen, Youjia Zhang, Junqing Yu, Wei Yang	Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.	TIGER, a novel framework for text-instructed retrieval and editing of 3D Gaussian scenes.	Editing objects in 3D scenes is crucial for various applications, and 3D Gaussian Splatting (3DGS) is gaining prominence as a scene representation method. Existing editing methods for 3D Gaussian scenes have limitations such as over-smoothing and inconsistency.	TIGER uses a bottom-up language aggregation strategy with MaskCLIP and FeatUp for open-vocabulary 3D Gaussian retrieval. For editing, it proposes Coherent Score Distillation (CSD) that integrates SDS losses from an image editing diffusion model (InstructPix2Pix) and a multi-view diffusion model (MVDream).	TIGER enables accurate open-vocabulary retrieval of 3D Gaussian objects. CSD facilitates multi-view consistent editing of 3D Gaussian scenes. TIGER produces more realistic and detailed edits compared to prior art.	The language understanding is limited by the “bag-of-words” problem inherited from MaskCLIP. The editing process depends on pre-trained 2D diffusion models and can take up to 30 minutes for complex edits.	3d gaussian splatting, 3d scene editing, text-guided editing, score distillation sampling, multi-view consistency
2405.14452 Report	JointRF: End-to-End Joint Optimization for Dynamic Neural Radiance Field Representation and Compression	Zihan Zheng, Houqiang Zhong, Qiang Hu, Xiaoyun Zhang, Li Song, Ya Zhang, Yanfeng Wang	Neural Radiance Field (NeRF) excels in photo-realistically static scenes, inspiring numerous efforts to facilitate volumetric videos. However, rendering dynamic and long-sequence radiance fields remains challenging due to the significant data required to represent volumetric videos. In this paper, we propose a novel end-to-end joint optimization scheme of dynamic NeRF representation and compression, called JointRF, thus achieving significantly improved quality and compression efficiency against the previous methods. Specifically, JointRF employs a compact residual feature grid and a coefficient feature grid to represent the dynamic NeRF. This representation handles large motions without compromising quality while concurrently diminishing temporal redundancy. We also introduce a sequential feature compression subnetwork to further reduce spatial-temporal redundancy. Finally, the representation and compression subnetworks are end-to-end trained combined within the JointRF. Extensive experiments demonstrate that JointRF can achieve superior compression performance across various datasets.	Presents JointRF, a novel end-to-end learning scheme for jointly optimizing dynamic NeRF representation and compression, achieving better quality and higher compression efficiency for volumetric videos.	Rendering dynamic and long-sequence radiance fields is challenging due to significant data requirements. JointRF addresses this by efficiently representing and compressing dynamic NeRF, improving quality and efficiency.	JointRF uses a compact residual feature grid and a coefficient feature grid to represent dynamic NeRF, minimizing temporal redundancy. It introduces a sequential feature compression subnetwork and employs end-to-end training with simulated quantization and entropy model-based bitrate estimation.	JointRF achieves superior compression performance across various datasets, as demonstrated by quantitative comparisons. It outperforms state-of-the-art methods in rate-distortion performance, with significant BDBR reductions compared to ReRF. Ablation studies confirm the effectiveness of the dynamic residual representation, compression module, and joint optimization strategy.	The current implementation primarily focuses on optimizing storage size and might benefit from exploring techniques like keyframe selection and adaptive group size to further enhance streaming efficiency. Investigating the application of JointRF in related domains like immersive video streaming and 3D telepresence presents promising avenues for future work.	volumetric videos, dynamic nerf, compression, end-to-end joint optimization, rate-distortion performance
2405.14430 Report	PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models	Jiannan Wang, Jiarui Fang, Aoyu Li, PengCheng Yang	This paper introduces PipeFusion, a novel approach that harnesses multi-GPU parallelism to address the high computational and latency challenges of generating high-resolution images with diffusion transformers (DiT) models. PipeFusion splits images into patches and distributes the network layers across multiple devices. It employs a pipeline parallel manner to orchestrate communication and computations. By leveraging the high similarity between the input from adjacent diffusion steps, PipeFusion eliminates the waiting time in the pipeline by reusing the one-step stale feature maps to provide context for the current step. Our experiments demonstrate that it can generate higher image resolution where existing DiT parallel approaches meet OOM. PipeFusion significantly reduces the required communication bandwidth, enabling DiT inference to be hosted on GPUs connected via PCIe rather than the more costly NVLink infrastructure, which substantially lowers the overall operational expenses for serving DiT models. Our code is publicly available at https://github.com/PipeFusion/PipeFusion.	This paper proposes PipeFusion, a novel pipelined parallel approach for Diffusion Transformers (DiT) inference that reduces communication bandwidth and memory demands by leveraging input similarities across diffusion steps.	High-resolution image and long video generation with DiT models face high computational and latency challenges, demanding multi-GPU parallelism. Existing approaches are limited by high communication costs and memory requirements, often necessitating costly NVLink interconnects.	PipeFusion splits images into patches, distributes DiT layers across multiple devices, and orchestrates computation and communication in a pipeline. It reuses one-step stale feature maps to provide context for the current step, eliminating pipeline waiting time.	PipeFusion significantly reduces communication bandwidth, enabling DiT inference on PCIe-connected GPUs, thus reducing operational costs. Experiments show PipeFusion achieves similar or better latency compared to other parallel approaches, especially on higher resolutions. PipeFusion maintains high image quality, comparable to the original serial implementation.	The effectiveness of PipeFusion diminishes on systems with high bandwidth interconnects like NVLink where communication cost is less of a bottleneck. Uneven partitioning of DiT layers across devices can lead to additional overhead, requiring further optimization.	diffusion models, parallel computing, image generation, diffusion transformers, pipeline parallelism
2405.14343 Report	Efficient Visual State Space Model for Image Deblurring	Lingshun Kong, Jiangxin Dong, Ming-Hsuan Yang, Jinshan Pan	Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. ViTs typically yield superior results in image restoration compared to CNNs due to their ability to capture long-range dependencies and input-dependent characteristics. However, the computational complexity of Transformer-based models grows quadratically with the image resolution, limiting their practical appeal in high-resolution image restoration tasks. In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) to visual data. In contrast to existing methods that employ several fixed-direction scanning for feature extraction, which significantly increases the computational cost, we develop an efficient visual scan block that applies various geometric transformations before each SSM-based module, capturing useful non-local information and maintaining high efficiency. Extensive experimental results show that the proposed EVSSM performs favorably against state-of-the-art image deblurring methods on benchmark datasets and real-captured images.	This paper introduces EVSSM, an efficient visual state space model for image deblurring, featuring an efficient visual scan block that employs geometric transformations to enhance non-local information exploration with minimal computational overhead.	Existing image deblurring methods, especially Transformer-based ones, often struggle to balance computational efficiency with capturing long-range dependencies crucial for high-quality restoration, especially at high resolutions.	The EVSSM employs a hierarchical encoder-decoder framework with efficient visual scan blocks. These blocks utilize geometric transformations (flip, transpose) before each scan, allowing for diverse contextual information capture from different directions without the computational burden of multi-directional scanning.	EVSSM outperforms state-of-the-art image deblurring methods on benchmark datasets (GoPro, HIDE, RealBlur) achieving higher PSNR and SSIM values. The proposed efficient visual scan block effectively captures non-local information, leading to better restoration of image structures and details compared to methods with limited global context modeling. EVSSM demonstrates lower computational complexity and faster runtime than competing methods while maintaining superior deblurring performance.	The current implementation explores limited geometric transformations (flip, transpose). Future work will investigate more sophisticated transformations like polar coordinate transformations to further enhance spatial information characterization using SSMs.	image deblurring, state space models, visual scan block, geometric transformations, deep learning
2405.14338 Report	MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models	Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang	Point cloud videos effectively capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing 3D world we live in. Although static 3D point cloud processing has witnessed significant advancements, designing an effective 4D point cloud video backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Moreover, recent state-of-the-art 4D backbones predominantly rely on transformer-based architectures, which commonly suffer from large computational costs due to their quadratic complexity, particularly when processing long video sequences. To address these challenges, we propose a novel 4D point cloud video understanding backbone based on the recently advanced State Space Models (SSMs). Specifically, our backbone begins by disentangling space and time in raw 4D sequences, and then establishing spatio-temporal correlations using our newly developed Intra-frame Spatial Mamba and Inter-frame Temporal Mamba blocks. The Intra-frame Spatial Mamba module is designed to encode locally similar or related geometric structures within a certain temporal searching stride, which can effectively capture short-term dynamics. Subsequently, these locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which globally integrates point features across the entire video with linear complexity, further establishing long-range motion dependencies. Experimental results on human action recognition and 4D semantic segmentation tasks demonstrate the superiority of our proposed method. Especially, for long video sequences, our proposed Mamba-based method has an 87.5% GPU memory reduction, 5.36 times speed-up, and much higher accuracy (up to +10.4%) compared with transformer-based counterparts on MSR-Action3D dataset.	This paper introduces MAMBA4D, a novel 4D point cloud video understanding backbone based entirely on state space models (SSMs), addressing the limitations of traditional CNN and transformer-based methods in terms of efficiency and long-range dependency modeling.	Developing effective learning backbones for dynamic 4D point cloud sequences is crucial for enabling intelligent agents to understand the dynamically changing 3D world, impacting applications like robotics, AR/VR, and SLAM systems. Existing methods struggle with efficiency due to the irregular nature of point clouds and limitations in capturing long-range temporal dependencies.	MAMBA4D disentangles space and time in 4D sequences and employs two novel modules: Intra-frame Spatial Mamba and Inter-frame Temporal Mamba. The former captures short-term local structures within temporally grouped point tubes, while the latter integrates features globally across the entire video with linear complexity to establish long-range dependencies. The authors also investigate different spatio-temporal scanning strategies within the Inter-frame Temporal Mamba.	MAMBA4D outperforms CNN- and transformer-based methods in 3D action recognition on the MSR-Action3D dataset, achieving higher accuracy and exhibiting superior efficiency (87.5% GPU memory reduction, 5.36x faster) compared to the transformer baseline. MAMBA4D demonstrates competitive performance in 4D semantic segmentation on the Synthia 4D dataset, surpassing existing methods on most sub-sequences and achieving a 0.19 mIoU improvement over the baseline. Ablation studies validate the contribution of individual components in MAMBA4D, including the spatial and temporal modeling modules, the number of blocks, and the choice of spatio-temporal scanning strategies.	While excelling in long video processing, MAMBA4D's accuracy for short video inputs is slightly lower compared to transformer-based methods. Future work will explore the application of MAMBA4D to other 4D tasks such as point-based object tracking, 4D point cloud prediction, and multi-frame scene flow.	4d point cloud video understanding, state space models, spatio-temporal modeling, action recognition, semantic segmentation
2405.14294 Report	Tuning-free Universally-Supervised Semantic Segmentation	Xiaobo Yang, Xiaojin Gong	This work presents a tuning-free semantic segmentation framework based on classifying SAM masks by CLIP, which is universally applicable to various types of supervision. Initially, we utilize CLIP's zero-shot classification ability to generate pseudo-labels or perform open-vocabulary segmentation. However, the misalignment between mask and CLIP text embeddings leads to suboptimal results. To address this issue, we propose discrimination-bias aligned CLIP to closely align mask and text embedding, offering an overhead-free performance gain. We then construct a global-local consistent classifier to classify SAM masks, which reveals the intrinsic structure of high-quality embeddings produced by DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive experiments validate the efficiency and effectiveness of our method, and we achieve state-of-the-art (SOTA) or competitive performance across various datasets and supervision types.	This paper introduces a novel tuning-free semantic segmentation framework that classifies Segment Anything Model (SAM) masks using Contrastive Language-Image Pretraining (CLIP) for diverse supervision levels (fully, semi, weakly, open-vocabulary).	This approach aims to leverage the zero-shot capabilities of CLIP and the efficiency of tuning-free methods to achieve accurate and adaptable semantic segmentation across different levels of supervision.	The framework utilizes a discrimination-bias aligned CLIP (DBA-CLIP) to generate text-aligned mask embeddings and a global-local consistent classifier (GLCC) for robust classification, particularly under sparse supervision.	The method achieves state-of-the-art or competitive results on PASCAL VOC 2012, COCO-Obj, MS COCO 2014, COCO-Stuff, and Cityscapes datasets. DBA-CLIP significantly improves CLIP's zero-shot classification accuracy by aligning text and mask embeddings. GLCC effectively mitigates noise in pseudo-labels, leading to more accurate segmentation, especially in weakly and semi-supervised settings.	The method's performance is limited by the capabilities of the underlying foundational models (SAM and CLIP). The high inference cost of foundational models may hinder deployment in resource-constrained environments.	semantic segmentation, tuning-free, clip, sam, weakly supervised learning
2405.14276 Report	D-MiSo: Editing Dynamic 3D Scenes using Multi-Gaussians Soup	Joanna Waczyńska, Piotr Borycki, Joanna Kaleta, Sławomir Tadeja, Przemysław Spurek	Over the past years, we have observed an abundance of approaches for modeling dynamic 3D scenes using Gaussian Splatting (GS). Such solutions use GS to represent the scene's structure and the neural network to model dynamics. Such approaches allow fast rendering and extracting each element of such a dynamic scene. However, modifying such objects over time is challenging. SC-GS (Sparse Controlled Gaussian Splatting) enhanced with Deformed Control Points partially solves this issue. However, this approach necessitates selecting elements that need to be kept fixed, as well as centroids that should be adjusted throughout editing. Moreover, this task poses additional difficulties regarding the re-productivity of such editing. To address this, we propose Dynamic Multi-Gaussian Soup (D-MiSo), which allows us to model the mesh-inspired representation of dynamic GS. Additionally, we propose a strategy of linking parameterized Gaussian splats, forming a Triangle Soup with the estimated mesh. Consequently, we can separately construct new trajectories for the 3D objects composing the scene. Thus, we can make the scene's dynamic editable over time or while maintaining partial dynamics.	This paper introduces D-MiSo, a novel mesh-inspired Gaussian Splatting method for modeling and editing dynamic 3D scenes.	Existing methods struggle with efficiently editing dynamic 3D scenes represented by Gaussian Splats. D-MiSo allows for intuitive and scalable object modifications over time.	D-MiSo utilizes Multi-Gaussian components: larger Core-Gaussians (for global transformations) encompassing smaller Sub-Gaussians (for rendering). These are parameterized by two Triangle Soups, enabling mesh-like control. Two deformation networks handle object movement and detailed changes over time.	D-MiSo achieves comparable or superior reconstruction quality to state-of-the-art methods on D-NeRF, NeRF-DS, and PanopticSports datasets. The method enables three editing approaches: modifying the estimated mesh, directly editing the Sub-Triangle Soup, and transforming the object's space. D-MiSo allows for intuitive object manipulation, including moving, scaling, rotating, duplicating, removing, and applying dynamic effects.	Editing areas poorly represented in the training set remains challenging due to limitations of Triangle Soup. Future work could explore more sophisticated meshing strategies or alternative representations for detailed editing.	gaussian splatting, dynamic 3d scenes, scene editing, mesh-based representation, multi-gaussian components
2405.14241 Report	NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation	Chaokang Jiang, Dalong Du, Jiuming Liu, Siting Zhu, Zhenqiang Liu, Zhuang Ma, Zhujin Liang, Jie Zhou	Point Cloud Interpolation confronts challenges from point sparsity, complex spatiotemporal dynamics, and the difficulty of deriving complete 3D point clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI, which excels at modeling complex non-rigid deformations across varied dynamic scenes. The method begins with an iterative Gaussian cloud soft clustering module, offering structured temporal point cloud representations. The proposed temporal radial basis function Gaussian residual utilizes Gaussian parameter interpolation over time, enabling smooth parameter transitions and capturing temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian deformation field tracks the evolution of these parameters, creating continuous spatiotemporal deformation fields. A 4D neural field transforms low-dimensional spatiotemporal coordinates ($x,y,z,t$) into a high-dimensional latent space. Finally, we adaptively and efficiently fuse the latent features from neural fields and the geometric features from Gaussian deformation fields. NeuroGauss4D-PCI outperforms existing methods in point cloud frame interpolation, delivering leading performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive), with scalability to auto-labeling and point cloud densification tasks. The source code is released at https://github.com/jiangchaokang/NeuroGauss4D-PCI.	This paper presents NeuroGauss4D-PCI, a novel 4D spatio-temporal modeling method for point cloud frame interpolation that excels at modeling complex non-rigid deformations across varied dynamic scenes by adaptively fusing a 4D neural field and a 4D Gaussian deformation field.	Point cloud frame interpolation (PCI) is crucial for various applications, including autonomous driving and virtual reality, but faces challenges due to the inherent sparsity of point cloud data, the complexity of modeling spatiotemporal dynamics, and the difficulty of generalizing from sparse temporal samples.	NeuroGauss4D-PCI represents point clouds through iterative Gaussian soft clustering and a 4D neural field. A temporal radial basis function Gaussian residual module captures temporal dynamics of Gaussian parameters, while a 4D Gaussian deformation field models their spatiotemporal variations. Finally, a fast latent-geometric fusion module combines features from the 4D neural field and the 4D Gaussian deformation field.	NeuroGauss4D-PCI achieves state-of-the-art performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive). The method effectively handles challenges like non-rigid deformations, large-scale motions, occlusions, and non-uniform data distributions. NeuroGauss4D-PCI demonstrates scalability to tasks like LiDAR-camera temporal synchronization, point cloud densification, and 4D automatic annotation.	The model's interpretability is limited due to the integration of various features and the use of deep neural networks. The runtime optimization process, similar to NeRF, is computationally demanding, accounting for nearly 90% of the total processing time.	point cloud interpolation, 4d spatio-temporal modeling, gaussian deformation field, neural field, autonomous driving
2405.14224 Report	DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis	Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu	Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate ``weak-to-strong'' training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., $1024\times 1024$ and $1536\times 1536$) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM.	Proposes Diffusion Mamba (DiM), a Mamba-based diffusion model for efficient high-resolution image synthesis, by combining the efficiency of Mamba with the expressive power of diffusion models.	Addresses the computational challenges of Transformer-based diffusion models in high-resolution image generation due to their quadratic complexity.	Introduces architectural designs like multi-directional scans, learnable padding tokens, and lightweight local feature enhancement to adapt Mamba for 2D image data. Employs a 'weak-to-strong' training strategy, pretraining on low-resolution images and fine-tuning on high-resolution images.	DiM achieves inference-time efficiency for high-resolution images, outperforming Transformers at resolutions above 1280x1280. Pretraining on low-resolution images and fine-tuning on high-resolution images significantly reduces training time and computational cost. Training-free upsampling techniques enable DiM to generate even higher resolution images (e.g., 1024x1024, 1536x1536) without further fine-tuning.	DiM, while faster at very high resolutions, is slightly less efficient than Transformers at resolutions below 1024x1024. The model still faces challenges in generating images with complex details, particularly for human subjects and in avoiding repeating patterns during upsampling. Future work could focus on optimizing DiM's efficiency at lower resolutions and improving its ability to handle complex details.	image generation, diffusion models, state space models, mamba, high-resolution
2405.14206 Report	LG-VQ: Language-Guided Codebook Learning	Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, linfeng Luo	Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.	This paper proposes LG-VQ, a language-guided codebook learning framework for VQ models that aligns codebooks with text semantics to enhance performance in multi-modal downstream tasks.	Existing VQ methods primarily focus on single-modal codebooks, leading to suboptimal performance in multi-modal tasks due to modal gaps and a lack of high-level semantics.	LG-VQ leverages pre-trained text semantics from CLIP and introduces two novel alignment modules: a Semantic Alignment Module (global semantic alignment and masked text prediction) and a Relationship Alignment Module (transfers semantic relationships between words to codes).	LG-VQ outperforms baseline VQ models in image reconstruction across multiple datasets, as evidenced by FID and PSNR scores. The method demonstrates strong performance in multi-modal downstream tasks like text-to-image synthesis, image captioning, and VQA, highlighting the effectiveness of the text-aligned codebook. Ablation studies confirm the individual contributions of the semantic and relationship alignment modules to the improved performance.	The current approach assumes a one-to-one mapping between words and codes, potentially overlooking more complex relationships. While LG-VQ significantly enhances VQ performance in visual text reasoning, there is still a performance gap compared to dedicated image captioning or VQA models.	vector quantization, multi-modal learning, codebook learning, vision-language representation learning, image generation
2405.14201 Report	FreeTuner: Any Subject in Any Style with Training-free Diffusion	Youcan Xu, Zhen Wang, Jun Xiao, Wei Liu, Long Chen	With the advance of diffusion models, various personalized image generation methods have been proposed. However, almost all existing work only focuses on either subject-driven or style-driven personalization. Meanwhile, state-of-the-art methods face several challenges in realizing compositional personalization, i.e., composing different subject and style concepts, such as concept disentanglement, unified reconstruction paradigm, and insufficient training data. To address these issues, we introduce FreeTuner, a flexible and training-free method for compositional personalization that can generate any user-provided subject in any user-provided style (see Figure 1). Our approach employs a disentanglement strategy that separates the generation process into two stages to effectively mitigate concept entanglement. FreeTuner leverages the intermediate features within the diffusion model for subject concept representation and introduces style guidance to align the synthesized images with the style concept, ensuring the preservation of both the subject's structure and the style's aesthetic features. Extensive experiments have demonstrated the generation ability of FreeTuner across various personalization settings.	FreeTuner, a training-free method for compositional personalization in image generation, enabling the synthesis of user-provided subjects in user-provided styles using diffusion models.	Addresses limitations of existing personalization methods that focus on either subject-driven or style-driven generation, failing to effectively compose both aspects.	Employs a two-stage disentanglement strategy: 1) Content generation stage leverages intermediate features from diffusion models for subject representation. 2) Style generation stage introduces style guidance based on pre-trained encoders (e.g., VGG-19) to align the output with desired style aesthetics.	Achieves high-quality compositional personalization, preserving both subject structure and style aesthetics. Outperforms existing methods like B-LoRA in composing subjects and styles, showing superior visual fidelity. Demonstrates generalizability by integrating seamlessly with other diffusion-based methods like ControlNet and BoxDiff.	Reliance on null-text inversion for accurate feature extraction increases computational time compared to standard inversion methods. Limited style transfer capability restricted to the employed visual encoder.	image generation, compositional personalization, diffusion models, style transfer, training-free methods
2405.14174 Report	Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model	Yuheng Shi, Minjing Dong, Chang Xu	Despite the significant achievements of Vision Transformers (ViTs) in various vision tasks, they are constrained by the quadratic complexity. Recently, State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity with respect to the input length, demonstrating substantial potential across fields including natural language processing and computer vision. To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted, which leads to significant redundancy of SSMs. For a better trade-off between efficiency and performance, we analyze the underlying reasons behind the success of the multi-scan strategy, where long-range dependency plays an important role. Based on the analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters. It employs a multi-scale 2D scanning technique on both original and downsampled feature maps, which not only benefits long-range dependency learning but also reduces computational costs. Additionally, we integrate a Convolutional Feed-Forward Network (ConvFFN) to address the lack of channel mixing. Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU with single-scale testing on ADE20K.Code is available at \url{https://github.com/YuHengsss/MSVMamba}.	This paper proposes Multi-Scale Vision Mamba (MSVMamba), an efficient and scalable State Space Model (SSM) for vision tasks, addressing the quadratic complexity issue of Vision Transformers (ViTs) and redundancy in multi-scan SSMs.	This work is important because it improves the efficiency and performance of SSMs in vision tasks by addressing the long-range dependency limitations of existing methods.	The paper introduces a Multi-Scale 2D (MS2D) scanning strategy and incorporates a Convolutional Feed-Forward Network (ConvFFN) to enhance channel mixing within the MSVMamba architecture.	MSVMamba achieves competitive results on ImageNet-1K, outperforming similar-sized models while using fewer computational resources. In object detection and instance segmentation tasks on COCO, MSVMamba surpasses Swin Transformer and other SSM-based models in terms of accuracy. For semantic segmentation on ADE20K, MSVMamba demonstrates superior performance compared to competing models like Swin, ConvNeXt, and VMamba.	The scalability of the multi-scale design needs further exploration, especially for larger model sizes where the improvement might be marginal. Future work could investigate the application of MSVMamba in other vision tasks beyond classification, detection, and segmentation.	state space models, vision transformers, computer vision, multi-scale modeling, long-range dependencies
2405.14129 Report	AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai	Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks, different tasks's instructions usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we assign different levels of alignment capabilities to different image-text pairs. Then, in the instruction-tuning phase, we adaptively combine these different levels of alignment capabilities to meet the dynamic alignment needs of different instructions. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.	This paper presents AlignGPT, a new multimodal large language model that enhances the alignment capabilities of existing MLLMs by considering varying degrees of alignment in image-text pairs.	Existing MLLMs often assume uniform alignment between image-text pairs during pre-training and overlook the different alignment requirements of various tasks during instruction-tuning.	AlignGPT introduces controllable alignment levels during pre-training based on CLIP scores and adaptively combines global and local alignment capabilities during instruction-tuning according to the specific needs of each instruction.	AlignGPT achieves competitive performance on 12 benchmarks, outperforming several state-of-the-art MLLMs. Higher image resolutions lead to improved model performance in most multimodal tasks. The choice of large language model significantly impacts AlignGPT's performance, with larger models and those fine-tuned on instructional data generally performing better.	AlignGPT might not excel in text-centric scenarios that demand a strong focus on text understanding. Future work can explore the impact of different visual backbones and more sophisticated gate network architectures.	multimodal large language model, cross-modal alignment, visual question answering, instruction tuning, clip score
2405.14119 Report	PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking	Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu	Recent advances in Multi-Object Tracking (MOT) have achieved remarkable success in short-term association within the decoupled tracking-by-detection online paradigm. However, long-term tracking still remains a challenging task. Although graph-based approaches can address this issue by modeling trajectories as a graph in the decoupled manner, their non-online nature poses obstacles for real-time applications. In this paper, we demonstrate that the trajectory graph is a directed acyclic graph, which can be represented by an object sequence arranged by frame and a binary adjacency matrix. It is a coincidence that the binary matrix matches the attention mask in the Transformer, and the object sequence serves exactly as a natural input sequence. Intuitively, we propose that a pure Transformer can naturally unify short- and long-term associations in a decoupled and online manner. Our experiments show that a classic Transformer architecture naturally suits the association problem and achieves a strong baseline compared to existing foundational methods across four datasets: DanceTrack, SportsMOT, MOT17, and MOT20, as well as superior generalizability in domain shift. Moreover, the decoupled property also enables efficient training and inference. This work pioneers a promising Transformer-based approach for the MOT task, and provides code to facilitate further research. https://github.com/chongweiliu/PuTR	This paper proposes PuTR, a pure Transformer architecture for online multi-object tracking (MOT) association, unifying short- and long-term association in a decoupled manner.	Existing online MOT methods struggle with long-term tracking, while offline graph-based methods lack real-time applicability. This work explores the potential of Transformers to address both short- and long-term association in a unified and efficient framework.	The authors leverage the natural alignment between the trajectory graph and the Transformer's attention mechanism. Objects are tokenized and fed into a Transformer with a modified attention mask and positional encodings to handle temporal and spatial relationships. A relative affinity matrix is used for association, eliminating the need for a fixed ID dictionary.	PuTR achieves strong baseline performance compared to existing foundational methods on DanceTrack, SportsMOT, MOT17, and MOT20 datasets. It exhibits superior generalization ability in domain shift scenarios, maintaining consistent performance across datasets without fine-tuning. The decoupled nature allows for efficient training (under 1 hour on a single GPU) and inference (up to 90 FPS).	The context length of the Transformer limits the model's ability to handle long sequences, requiring further exploration of long context modeling. The current model primarily relies on appearance cues, and incorporating motion information could enhance performance, particularly for small objects.	multi-object tracking, transformer, association, online tracking, long-term tracking
2405.14101 Report	Enhancing Image Layout Control with Loss-Guided Diffusion Models	Zakaria Patel, Kirill Serkh	Diffusion models are a powerful class of generative models capable of producing high-quality images from pure noise. In particular, conditional diffusion models allow one to specify the contents of the desired image using a simple text prompt. Conditioning on a text prompt alone, however, does not allow for fine-grained control over the composition and layout of the final image, which instead depends closely on the initial noise distribution. While most methods which introduce spatial constraints (e.g., bounding boxes) require fine-tuning, a smaller and more recent subset of these methods are training-free. They are applicable whenever the prompt influences the model through an attention mechanism, and generally fall into one of two categories. The first entails modifying the cross-attention maps of specific tokens directly to enhance the signal in certain regions of the image. The second works by defining a loss function over the cross-attention maps, and using the gradient of this loss to guide the latent. While previous work explores these as alternative strategies, we provide an interpretation for these methods which highlights their complimentary features, and demonstrate that it is possible to obtain superior performance when both methods are used in concert.	Presents injection loss guidance (iLGD), a training-free method for layout control in text-to-image generation using diffusion models.	Existing methods for controlling image layout often degrade image quality or require expensive fine-tuning. iLGD addresses these limitations by combining the strengths of attention injection and loss guidance.	iLGD biases the latent representation of the image towards a desired layout using attention injection and refines it further with a loss function applied to the attention maps.	iLGD generates images that adhere more closely to the prescribed layout compared to using injection alone. iLGD maintains better image quality than methods relying solely on loss guidance (e.g., BoxDiff). iLGD achieves higher scores on perceptual quality metrics (CLIP-IQA) while maintaining comparable layout accuracy (YOLOv4) and text-image similarity (T2I-Sim) to other methods.	The method's sensitivity to the initial random seed requires further investigation. Exploring alternative loss functions and injection strategies could further enhance layout control and image quality.	diffusion models, layout control, text-to-image generation, attention injection, loss guidance
2405.14024 Report	Two Heads are Better Than One: Neural Networks Quantization with 2D Hilbert Curve-based Output Representation	Mykhailo Uss, Ruslan Yermolenko, Olena Kolodiazhna, Oleksii Shashko, Ivan Safonov, Volodymyr Savin, Yoonjae Yeo, Seowon Ji, Jaeyun Jeong	Quantization is widely used to increase deep neural networks' (DNN) memory, computation, and power efficiency. Various techniques, such as post-training quantization and quantization-aware training, have been proposed to improve quantization quality. We introduce a novel approach for DNN quantization that uses a redundant representation of DNN's output. We represent the target quantity as a point on a 2D parametric curve. The DNN model is modified to predict 2D points that are mapped back to the target quantity at a post-processing stage. We demonstrate that this mapping can reduce quantization error. For the low-order parametric Hilbert curve, Depth-From-Stereo task, and two models represented by U-Net architecture and vision transformer, we achieved a quantization error reduction by about 5 times for the INT8 model at both CPU and DSP delegates. This gain comes with a minimal inference time increase (less than 7%). Our approach can be applied to other tasks, including segmentation, object detection, and key-points prediction.	This paper introduces a novel DNN quantization method that reduces quantization error by representing the output as a point on a 2D low-order Hilbert curve, exploiting the redundancy in this representation.	Quantization is crucial for deploying DNNs on devices with limited resources, but it often leads to quality degradation. This method offers a way to mitigate this degradation and improve the accuracy of quantized models.	The authors modify the DNN architecture to predict points on a Hilbert curve instead of a scalar output. They introduce a new loss function to guide the training process and utilize lookup tables for efficient mapping between 1D and 2D representations.	The proposed method reduces quantization error by a factor of ≈5 for INT8 models on both CPU and DSP, achieving near-FP32 accuracy for the Depth-From-Stereo task. The Hilbert curve representation effectively increases the bit-width of the output, improving the representation of spatial details in the quantized model. The method incurs a minimal increase in inference time (<7%) without noticeable impact on power consumption.	The approach is currently limited to models predicting bounded quantities and may not correct large quantization errors (outliers). Further research is needed to explore its application to other tasks, quantization techniques, and higher-dimensional representations.	quantization-aware training, space-filling curve, hilbert curve, depth-from-stereo, snapdragon neural processing engine
2405.13956 Report	Attention as an RNN	Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, Greg Mori	The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its \textit{many-to-one} RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention's \textit{many-to-many} RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce \textbf{Aaren}, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.	This paper introduces Aaren, an attention-based module for sequence modeling that achieves comparable performance to Transformers while being more time and memory efficient.	Transformers, while powerful, are computationally expensive at inference time, limiting their use in low-resource settings like mobile devices. Aaren addresses this limitation.	The paper first presents attention as a special type of Recurrent Neural Network (RNN) and then introduces a new method to efficiently compute attention's RNN output based on the parallel prefix scan algorithm. Aaren builds upon this formulation.	Aarens achieve comparable performance to Transformers across 38 datasets in four problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting. Aarens demonstrate significant improvements in time and memory efficiency compared to Transformers, using constant memory regardless of the number of tokens processed. Aarens can efficiently update with new tokens at inference time, making them particularly well-suited for streaming data scenarios common in sequence modeling.	Aarens use input-independent attention queries, potentially limiting their expressiveness in large sequence models compared to Transformers. Future work could explore applying Aarens to more complex sequence modeling tasks, such as natural language processing, to further investigate their capabilities and limitations.	attention mechanism, sequence modeling, recurrent neural networks, parallel prefix scan, efficient inference
2405.13951 Report	Text Prompting for Multi-Concept Video Customization by Autoregressive Generation	Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh	We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.	This paper introduces a novel approach for multi-concept customization of pretrained text-to-video (T2V) models using autoregressive generation, allowing users to generate videos featuring multiple customized concepts and their interactions.	Existing T2V models struggle with generating long videos featuring consistent subjects and their interactions, especially when dealing with multiple customized concepts. This work addresses this limitation by enabling more control and flexibility in video generation.	The method involves finetuning a pretrained T2V model with adapter layers on input images/videos representing customized concepts. Then, it leverages the autoregressive nature of the model to sequentially generate video frames, introducing and controlling the appearance and interactions of multiple customized concepts over time.	The approach effectively generates customized videos with multiple interacting subjects, demonstrating significant improvements over baseline methods. Quantitative evaluations using videoCLIP and DINO scores, along with human evaluation, showcase the effectiveness in generating customized concepts and their interactions. The method also proves useful for single-concept customization when compositionality is desired, offering a promising direction for future research.	Extending the method beyond three concepts and achieving finer control over interactions remain challenging. Improving the quality of generated videos relies heavily on advancements in video foundation models and superresolution techniques.	text-to-video generation, multi-concept customization, autoregressive generation, video personalization, generative ai
2405.13943 Report	DoGaussian: Distributed-Oriented Gaussian Splatting for Large-Scale 3D Reconstruction Via Gaussian Consensus	Yu Chen, Gim Hee Lee	The recent advances in 3D Gaussian Splatting (3DGS) show promising results on the novel view synthesis (NVS) task. With its superior rendering performance and high-fidelity rendering quality, 3DGS is excelling at its previous NeRF counterparts. The most recent 3DGS method focuses either on improving the instability of rendering efficiency or reducing the model size. On the other hand, the training efficiency of 3DGS on large-scale scenes has not gained much attention. In this work, we propose DoGaussian, a method that trains 3DGS distributedly. Our method first decomposes a scene into K blocks and then introduces the Alternating Direction Method of Multipliers (ADMM) into the training procedure of 3DGS. During training, our DoGaussian maintains one global 3DGS model on the master node and K local 3DGS models on the slave nodes. The K local 3DGS models are dropped after training and we only query the global 3DGS model during inference. The training time is reduced by scene decomposition, and the training convergence and stability are guaranteed through the consensus on the shared 3D Gaussians. Our method accelerates the training of 3DGS by 6+ times when evaluated on large-scale scenes while concurrently achieving state-of-the-art rendering quality. Our project page is available at https://aibluefisher.github.io/DoGaussian.	This paper introduces DoGaussian, a distributed training approach for 3D Gaussian Splatting (3DGS) aimed at improving the efficiency of large-scale scene reconstruction.	Training 3DGS on large scenes poses challenges due to high GPU memory requirements and long training times. DoGaussian addresses these issues by enabling efficient distributed training.	DoGaussian decomposes the scene into blocks, assigns training data to each block, and uses the Alternating Direction Method of Multipliers (ADMM) to ensure consistency among shared 3D Gaussians during training.	DoGaussian accelerates the training of 3DGS by 6+ times compared to the original 3DGS on large-scale scenes. The method maintains high-fidelity rendering quality, achieving state-of-the-art results in novel view synthesis. Ablation studies demonstrate the effectiveness of individual components, such as 3D Gaussian consensus and adaptive penalty parameters.	The current implementation relies on an RPC module for communication, which might limit flexibility compared to decentralized approaches. Future work could explore incorporating level-of-detail (LOD) techniques to further reduce GPU memory consumption during training.	3d gaussian splatting, large-scale 3d reconstruction, distributed training, novel view synthesis, admm
2405.13870 Report	FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition	Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, Chunhua Shen	Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end, we propose FreeCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler. Codes can be found at https://github.com/aim-uofa/FreeCustom.	This paper presents FreeCustom, a novel tuning-free method for generating customized images with multi-concept composition, using only one image per concept as input.	Existing customization methods struggle with multi-concept scenarios, often requiring time-consuming retraining and exhibiting poor identity preservation. FreeCustom addresses these limitations by enabling fast, high-quality generation without any training.	The method employs a dual-path architecture with a multi-reference self-attention (MRSA) mechanism and a weighted mask strategy. This enables the generated image to effectively integrate and focus on features from multiple input reference concepts.	FreeCustom achieves comparable results to state-of-the-art methods in single-concept customization and shows significant advantages in multi-concept composition. The method demonstrates high fidelity in preserving reference concept identities and strong alignment with input text prompts. FreeCustom is significantly faster than training-based methods, achieving high-quality results in seconds without any preprocessing.	The method currently lacks an explicit module for perceiving the structure of input reference concepts. Future work will explore incorporating techniques like image adapters to address this limitation.	image generation, customization, text-to-image, diffusion models, multi-concept composition
2405.13865 Report	ReVideo: Remake a Video with Motion and Content Control	Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang	Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion. Content editing is facilitated by modifying the first frame, while the trajectory-based motion control offers an intuitive user interaction experience. ReVideo addresses a new task involving the coupling and training imbalance between content and motion control. To tackle this, we develop a three-stage training strategy that progressively decouples these two aspects from coarse to fine. Furthermore, we propose a spatiotemporal adaptive fusion module to integrate content and motion control across various sampling steps and spatial locations. Extensive experiments demonstrate that our ReVideo has promising performance on several accurate video editing applications, i.e., (1) locally changing video content while keeping the motion constant, (2) keeping content unchanged and customizing new motion trajectories, (3) modifying both content and motion trajectories. Our method can also seamlessly extend these applications to multi-area editing without specific training, demonstrating its flexibility and robustness.	This paper introduces ReVideo, a novel method for accurate and localized content and motion editing in videos.	Existing video editing techniques struggle with precise local control, especially for motion, limiting their ability for realistic and creative edits.	ReVideo utilizes a three-stage training strategy to decouple content and motion control, along with a spatiotemporal adaptive fusion module for harmonious integration within a diffusion model framework.	ReVideo enables localized content changes while preserving motion or introducing custom trajectories. It surpasses existing methods in user-specified editing accuracy, as demonstrated by both visual and quantitative comparisons. The method shows robustness to irregular editing regions and multi-area editing tasks without specific training.	The quality of regenerated video segments depends on the base model's capabilities, which can lead to artifacts. Future work includes extending ReVideo to handle longer videos and address error accumulation over time.	video editing, diffusion models, motion editing, local editing, spatiotemporal fusion
2405.13800 Report	Dense Connector for MLLMs	Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang	Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.	This paper introduces the Dense Connector, a plug-and-play module enhancing visual perception in Multimodal Large Language Models (MLLMs) by densely integrating multi-layer visual features.	Current MLLM research focuses heavily on the language side, neglecting the potential of visual encoders. This work aims to leverage the overlooked "free lunch" of offline multi-layer features for enhanced visual representation without significant computational overhead.	The Dense Connector leverages pre-trained vision encoders and LLMs, connected by a learnable MLP. It offers three instantiations for multi-layer feature integration: Sparse Token Integration (STI), Sparse Channel Integration (SCI), and Dense Channel Integration (DCI). The paper conducts experiments across various vision encoders, LLM sizes, datasets, and image resolutions.	Dense Connector significantly improves MLLM performance across 11 image and 8 video benchmarks, achieving state-of-the-art results on several. The approach demonstrates versatility and scalability across different visual encoders, LLM sizes (2B→70B), and training datasets. Densely integrated multi-layer features prove more effective than solely using the final-layer features.	The current Dense Connector instantiations do not introduce additional learnable parameters, potentially limiting its effectiveness. Future work will explore more complex and effective Dense Connector implementations and investigate efficient visual-language model connection methods for better modality alignment.	multimodal large language models, vision-language models, dense connector, multi-layer visual features, visual understanding
2405.13748 Report	Monocular Gaussian SLAM with Language Extended Loop Closure	Tian Lan, Qinwei Lin, Haoqian Wang	Recently,3DGaussianSplattinghasshowngreatpotentialin visual Simultaneous Localization And Mapping (SLAM). Existing methods have achieved encouraging results on RGB-D SLAM, but studies of the monocular case are still scarce. Moreover, they also fail to correct drift errors due to the lack of loop closure and global optimization. In this paper, we present MG-SLAM, a monocular Gaussian SLAM with a language-extended loop closure module capable of performing drift-corrected tracking and high-fidelity reconstruction while achieving a high-level understanding of the environment. Our key idea is to represent the global map as 3D Gaussian and use it to guide the estimation of the scene geometry, thus mitigating the efforts of missing depth information. Further, an additional language-extended loop closure module which is based on CLIP feature is designed to continually perform global optimization to correct drift errors accumulated as the system runs. Our system shows promising results on multiple challenging datasets in both tracking and mapping and even surpasses some existing RGB-D methods.	This paper presents MG-SLAM, a novel monocular Gaussian SLAM system that leverages 3D Gaussian representations for high-fidelity scene reconstruction and incorporates a language-extended loop closure module for drift-corrected tracking.	This work addresses the limitations of existing monocular SLAM systems in achieving both accurate tracking and photo-realistic reconstruction, particularly over long sequences where drift errors accumulate. The integration of language understanding further expands the system's potential applications.	The system builds upon DPVO, a deep-learning-based visual odometry. It initializes and optimizes 3D Gaussians using predicted patch depths and employs a sliding window strategy for training. A CLIP feature-based loop closure module detects loops and enables global optimization on a back-end graph, correcting drift errors.	MG-SLAM achieves competitive tracking accuracy on Replica, ScanNet, TUM RGB-D, and EuRoC datasets, outperforming some existing RGB-D methods. The system demonstrates high-fidelity rendering quality, surpassing previous NeRF-based SLAM approaches. The language-extended loop closure module enables text-to-trajectory querying, highlighting its potential for high-level scene understanding.	The performance of the loop closure module might degrade in highly cluttered indoor environments. Future work could explore the integration of semantic information into the mapping process for enhanced scene understanding and navigation.	slam, 3d gaussian splatting, scene reconstruction, loop closure, clip
2405.13729 Report	ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models	Rui Xu, Jiepeng Wang, Hao Pan, Yang Liu, Xin Tong, Shiqing Xin, Changhe Tu, Taku Komura, Wenping Wang	In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, there are additional attributes which are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models, causing degraded test time performance. We present a simple fix to this problem by constructing stochastic processes that fully exploit the combinatorial structures, hence the name ComboStoc. Using this simple strategy, we show that network training is significantly accelerated across diverse data modalities, including images and 3D structured shapes. Moreover, ComboStoc enables a new way of test time generation which uses insynchronized time steps for different dimensions and attributes, thus allowing for varying degrees of control over them.	This paper presents ComboStochastic ( ame), a novel approach to enhance diffusion generative models by explicitly considering the combinatorial complexity arising from dimensions and attributes of data samples.	Existing diffusion models lack sufficient training in regions of the path space where dimensions/attributes have asynchronous schedules, leading to poor performance when sampling these regions during inference.	The authors introduce asynchronous time steps for different dimensions and attributes during training, enabling the network to explore a wider range of data representations and learn correlations more effectively.	ame consistently improves FID scores compared to baseline SiT models for image generation on ImageNet. ame proves crucial for generating structured 3D shapes, significantly outperforming baseline methods and producing meaningful results. Asynchronous time steps enable novel applications such as controlled image generation with varying degrees of preservation and structured 3D shape completion/assembly.	Quantifying the severity of the undersampling problem in standard diffusion models is left for future work. Exploring better batch time step scheduling for image generation training is an area for future improvement.	diffusion generative models, combinatorial complexity, asynchronous time steps, image generation, structured 3d shape generation
2405.13722 Report	InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos	Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent Y. F. Tan, Jiashi Feng	Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based image editing framework that achieves pixel-level control using Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced this framework's generality by leveraging large-scale diffusion models. However, these methods often suffer from inordinately long processing times (exceeding 1 minute per edit) and low success rates. Addressing these issues head on, we present InstaDrag, a rapid approach enabling high quality drag-based image editing in ~1 second. Unlike most previous methods, we redefine drag-based editing as a conditional generation task, eliminating the need for time-consuming latent optimization or gradient-based guidance during inference. In addition, the design of our pipeline allows us to train our model on large-scale paired video frames, which contain rich motion information such as object translations, changing poses and orientations, zooming in and out, etc. By learning from videos, our approach can significantly outperform previous methods in terms of accuracy and consistency. Despite being trained solely on videos, our model generalizes well to perform local shape deformations not presented in the training data (e.g., lengthening of hair, twisting rainbows, etc.). Extensive qualitative and quantitative evaluations on benchmark datasets corroborate the superiority of our approach. The code and model will be released at https://github.com/magic-research/InstaDrag.	InstaDrag, a fast and high-quality drag-based image editing approach that achieves results in under one second.	Existing drag-based image editing methods suffer from slow processing times and low success rates, limiting practical use.	Reframes drag-based editing as a conditional generation task using a reference-only architecture and point embedding network trained on large-scale video data.	Achieves state-of-the-art accuracy in point following and appearance preservation as measured by Mean Distance and Image Fidelity metrics. Significantly faster than previous methods, achieving editing speeds of under one second, even faster with acceleration techniques. Generalizes well to out-of-domain editing instructions not explicitly present in the training data, such as local deformations.	Inherits limitations of Stable Diffusion V1.5, such as difficulties with details in complex features. Future work could explore using larger diffusion models like SDXL for improved detail.	image editing, drag-based editing, diffusion models, video data, conditional generation
2405.13685 Report	Prompt Mixing in Diffusion Models using the Black Scholes Algorithm	Divya Kothandaraman, Ming Lin, Dinesh Manocha	We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts using pre-trained text-to-image diffusion models. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. To do so, we leverage the connection between diffusion models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model for pricing options in Finance, and draw analogies between the variables in both contexts to derive an appropriate algorithm for prompt mixing using the Black Scholes model. Specifically, the parallels between diffusion models and the Black-Scholes model enable us to leverage properties related to the dynamics of the Markovian model derived in the Black-Scholes algorithm. Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it qualitatively and quantitatively to other prompt mixing techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and objects against backgrounds. Code is available at https://github.com/divyakraman/BlackScholesDiffusion2024.	This paper introduces a novel prompt mixing technique for text-to-image diffusion models, inspired by the Black-Scholes model from finance, which dynamically conditions on the most relevant text prompt during each denoising step to generate images reflecting multiple input concepts.	Prompt mixing is important for generating images that blend different textual concepts, going beyond simple combinations. Existing techniques often require manual effort, lack dynamic prompt prioritization, or overlook diffusion dynamics.	The method draws an analogy between diffusion models and the Black-Scholes model, treating image generation as "asset acquisition." It uses the CLIP score as a measure of "stock price" and leverages diffusion dynamics to calculate Black-Scholes variables. At each denoising step, it conditions the model on the prompt with the lowest Black-Scholes score, indicating the concept requiring most attention.	The proposed Black-Scholes-based method outperforms baselines like linear interpolation, alternating prompts, and CLIP-guided selection in generating realistic and concept-blending images. It effectively preserves individual characteristics from multiple prompts while minimizing unrealistic artifacts. Quantitative evaluation using CLIP scores demonstrates superior performance compared to other techniques.	The reliance on CLIP scores for evaluation has limitations as it might not capture subtle quality differences and is prone to biases. The study focuses on two-prompt mixing, limiting its generalizability to a larger number of prompts.	prompt mixing, text-to-image diffusion, black-scholes model, clip score, generative ai
2405.13637 Report	Curriculum Direct Preference Optimization for Diffusion and Consistency Models	Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah	Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on three benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://anonymous.4open.science/r/Curriculum-DPO-EE14.	Proposes Curriculum DPO, a novel training regime for diffusion and consistency models that enhances Direct Preference Optimization (DPO) with curriculum learning for improved text-to-image generation.	Aims to address the limitations of existing DPO methods that randomly sample image pairs during training, leading to suboptimal performance in text alignment, aesthetics, and human preference.	Implements a two-stage training process: 1) uses a reward model to rank generated images by preference, 2) creates easy-to-hard image pairs based on ranking difference and trains the generative model progressively with these pairs.	Curriculum DPO outperforms state-of-the-art fine-tuning methods (DPO, DDPO) in text alignment, aesthetics, and human preference scores across three benchmarks. Subjective human evaluation confirms Curriculum DPO generates significantly more preferred samples compared to baselines. Ablation studies demonstrate the effectiveness of curriculum learning and the impact of hyperparameter choices.	Introduces additional hyperparameters (e.g., number of batches, iterations per batch) requiring tuning. Doesn't address the inherent limitation of text-to-image models in disambiguating words with multiple meanings, which can lead to poor generation results in certain cases.	text-to-image generation, curriculum learning, direct preference optimization, diffusion models, consistency models
2405.13540 Report	Directly Denoising Diffusion Model	Dan Zhang, Jingjing Wang, Feng Luo	In this paper, we present the Directly Denoising Diffusion Model (DDDM): a simple and generic approach for generating realistic images with few-step sampling, while multistep sampling is still preserved for better performance. DDDMs require no delicately designed samplers nor distillation on pre-trained distillation models. DDDMs train the diffusion model conditioned on an estimated target that was generated from previous training iterations of its own. To generate images, samples generated from the previous time step are also taken into consideration, guiding the generation process iteratively. We further propose Pseudo-LPIPS, a novel metric loss that is more robust to various values of hyperparameter. Despite its simplicity, the proposed approach can achieve strong performance in benchmark datasets. Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models. By extending the sampling to 1000 steps, we further reduce FID score to 1.79, aligning with state-of-the-art methods in the literature. For ImageNet 64x64, our approach stands as a competitive contender against leading models.	This paper presents Directly Denoising Diffusion Models (DDDM), a novel approach for generating high-quality images with both single-step and multi-step sampling, without needing specially designed samplers or distillation.	Diffusion models typically require many steps for high-quality generation, making them slow. DDDM enables both efficient single-step generation comparable to GANs and improved quality with iterative sampling.	DDDM iteratively refines an estimate of the original data by training a neural network to approximate the solution of the probability flow ODE. It uses a novel Pseudo-LPIPS loss function for robustness.	DDDM achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing GANs and distillation methods. On ImageNet 64x64, DDDM is competitive with leading models, showing strong FID scores and improved precision/recall compared to iCT. Increasing sampling steps in DDDM consistently improves FID, demonstrating the benefit of its iterative approach.	DDDM's training incurs additional memory overhead due to storing data estimates. Evaluation might be biased by using ImageNet features in both LPIPS and FID.	diffusion models, image generation, fast sampling, pseudo-lpips, iterative solution
2405.13473 Report	Class-Conditional self-reward mechanism for improved Text-to-Image models	Safouane El Ghazouali, Arnaud Gucciardi, Umberto Michelucci	Self-rewarding have emerged recently as a powerful tool in the field of Natural Language Processing (NLP), allowing language models to generate high-quality relevant responses by providing their own rewards during training. This innovative technique addresses the limitations of other methods that rely on human preferences. In this paper, we build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models. This approach works by fine-tuning diffusion model on a self-generated self-judged dataset, making the fine-tuning more automated and with better data quality. The proposed mechanism makes use of other pre-trained models such as vocabulary based-object detection, image captioning and is conditioned by the a set of object for which the user might need to improve generated data quality. The approach has been implemented, fine-tuned and evaluated on stable diffusion and has led to a performance that has been evaluated to be at least 60\% better than existing commercial and research Text-to-image models. Additionally, the built self-rewarding mechanism allowed a fully automated generation of images, while increasing the visual quality of the generated images and also more efficient following of prompt instructions. The code used in this work is freely available on https://github.com/safouaneelg/SRT2I.	This paper introduces a novel 'class-conditional self-rewarding' (CCSR) mechanism for automating the optimization of Text-to-Image (T2I) models, enhancing their ability to generate images that accurately reflect specific object classes and prompts.	Current T2I model fine-tuning often relies on human feedback and reinforcement learning, which can be resource-intensive and prone to biases. This paper aims to automate this process and improve the quality of generated images.	The CCSR mechanism utilizes a multi-step process: 1) LLM-based prompt generation, 2) Multi-image generation from prompts, 3) Image-to-Text (I2T) based self-judging of generated images, 4) Open-vocabulary object detection for filtering, 5) Selection of optimal image-prompt pairs, 6) Fine-tuning of the T2I model (Stable Diffusion) using the selected pairs.	The CCSR mechanism leads to a significant improvement in the quality of generated images, particularly in terms of realism, prompt adherence, and accurate depiction of the targeted object class. Fine-tuning Stable Diffusion with the CCSR-generated dataset resulted in a higher CLIP score compared to the baseline Stable Diffusion and a fine-tuned SDXS model. The proposed method allows for complete automation of T2I model improvement without requiring human intervention.	The 'class-conditional' nature of the mechanism might limit its generalizability to broader image-text relationships beyond the specifically trained object classes. Continuous application of the CCSR loop with diverse classes is suggested for enhancing the model's overall generalizability.	text-to-image synthesis, self-rewarding models, diffusion models, image captioning, open-vocabulary object detection
2405.13360 Report	How to Trace Latent Generative Model Generated Images without Artificial Watermark?	Zhenting Wang, Vikash Sehwag, Chen Chen, Lingjuan Lyu, Dimitris N. Metaxas, Shiqing Ma	Latent generative models (e.g., Stable Diffusion) have become more and more popular, but concerns have arisen regarding potential misuse related to images generated by these models. It is, therefore, necessary to analyze the origin of images by inferring if a particular image was generated by a specific latent generative model. Most existing methods (e.g., image watermark and model fingerprinting) require extra steps during training or generation. These requirements restrict their usage on the generated images without such extra operations, and the extra required operations might compromise the quality of the generated images. In this work, we ask whether it is possible to effectively and efficiently trace the images generated by a specific latent generative model without the aforementioned requirements. To study this problem, we design a latent inversion based method called LatentTracer to trace the generated images of the inspected model by checking if the examined images can be well-reconstructed with an inverted latent input. We leverage gradient based latent inversion and identify a encoder-based initialization critical to the success of our approach. Our experiments on the state-of-the-art latent generative models, such as Stable Diffusion, show that our method can distinguish the images generated by the inspected model and other images with a high accuracy and efficiency. Our findings suggest the intriguing possibility that today's latent generative generated images are naturally watermarked by the decoder used in the source models. Code: https://github.com/ZhentingWang/LatentTracer.	This paper introduces LatentTracer, an alteration-free method for tracing images generated by a specific latent generative model. It leverages latent inversion, focusing on the inherent watermarking properties of the model's decoder.	Tracing the origin of images generated by latent generative models is crucial to address potential misuse, such as the spread of harmful content or intellectual property infringement.	LatentTracer utilizes a gradient-based optimization approach to reconstruct the examined image by inverting the latent input of the inspected model's decoder. The key innovation lies in using the encoder to initialize the optimization process, significantly enhancing effectiveness and efficiency.	LatentTracer achieves high accuracy (over 93%) in distinguishing between images generated by the inspected model and those from other models, even with similar architectures. The method proves effective in differentiating between model-generated images and real images. LatentTracer exhibits efficiency, outperforming existing alteration-free methods in terms of runtime.	The method's performance in scenarios where models share the same autoencoder requires further investigation. Future work could explore the robustness against strong post-processing techniques that significantly alter the image while preserving visual quality.	image origin attribution, latent generative models, latent inversion, alteration-free watermarking, responsible ai
2405.13337 Report	Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer	Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He	The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named \textbf{S}emantic \textbf{E}quitable \textbf{C}lustering (SEC). SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. In contrast to traditional clustering methods requiring multiple iterations, our method achieves token clustering in a single pass. Additionally, SEC regulates the number of tokens per cluster, ensuring a balanced distribution for effective parallel processing on current computational platforms without necessitating further optimization. Capitalizing on SEC, we propose a versatile vision backbone, SecViT. Comprehensive experiments in image classification, object detection, instance segmentation, and semantic segmentation validate to the effectiveness of SecViT. Remarkably, SecViT attains an impressive \textbf{84.2\%} image classification accuracy with only \textbf{27M} parameters and \textbf{4.4G} FLOPs, without the need for for additional supervision or data. Code will be available at \url{https://github.com/qhfan/SecViT}.	This paper introduces Semantic Equitable Clustering (SEC), a novel, efficient single-pass clustering method that groups tokens based on global semantic relevance for Vision Transformers (ViT), leading to enhanced computational efficiency and performance in various vision tasks.	The quadratic complexity of global attention in ViTs poses significant computational challenges. While token grouping methods address this, they often overlook semantic relationships, hindering effective modeling of inter-token dependencies. SEC offers a solution by efficiently clustering tokens based on semantic information, optimizing both computational efficiency and performance.	SEC employs global pooling to derive a global semantic token. It then calculates cosine similarity between this token and all others, sorting them based on similarity scores. Tokens with similar scores are grouped into clusters, ensuring an equal distribution for efficient parallel processing.	SecViT, built upon SEC, consistently surpasses previous state-of-the-art models in image classification across different model scales. Directly replacing attention mechanisms in Swin-Transformer and FasterViT with SEC leads to significant performance gains in image classification. SecViT exhibits impressive performance on downstream tasks such as object detection, instance segmentation, and semantic segmentation.	Computational constraints limit experimentation with larger models and datasets like ImageNet-21k. Future work involves exploring the scalability of SEC on larger datasets and models.	vision transformer, token clustering, semantic equitable clustering, computational efficiency, computer vision
2405.13335 Report	Vision Transformer with Sparse Scan Prior	Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He	In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism ($\rm{S}^3\rm{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $\rm{S}^3\rm{A}$, we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \textbf{84.4\%/85.7\%} with \textbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at \url{https://github.com/qhfan/SSViT}.	This paper proposes Sparse Scan Self-Attention (S³A), a novel self-attention mechanism inspired by the sparse scanning mechanism of the human eye, and builds Sparse Scan Vision Transformer (SSViT) based on it.	Existing Vision Transformer models often suffer from high computational costs associated with their self-attention mechanisms. While several strategies have been proposed to improve efficiency, they often deviate significantly from the efficient visual information processing employed by the human eye.	The S³A mechanism defines Anchors of Interest (AoI) for each token and uses local attention to model spatial information around these anchors. This approach, mimicking the human eye, reduces computational load while effectively capturing both local and global information. Extensive experiments are conducted on ImageNet classification, object detection, instance segmentation, and semantic segmentation to demonstrate SSViT's effectiveness and efficiency.	SSViT achieves state-of-the-art accuracy on ImageNet classification with significantly reduced computational cost compared to previous models. The model excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation, outperforming counterparts across different benchmark datasets. SSViT demonstrates strong robustness against out-of-distribution data, showcasing its ability to generalize well beyond the training dataset.	Computational constraints limited the exploration of SSViT on larger models and datasets, such as ImageNet-21k. Future work will focus on validating the performance of SSViT on such large-scale datasets.	vision transformer, self-attention, sparse scan, computer vision, deep learning
2405.13218 Report	Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction	Maciej Kilian, Varun Japan, Luke Zettlemoyer	Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.	This paper presents a compute-controlled comparison of transformer-based diffusion, masked-token prediction, and next-token prediction for latent image synthesis.	Despite the common Transformer backbone in recent image synthesis approaches, there lacks a direct comparison of their performance and efficiency under controlled compute budgets.	The authors train a grid of models using these approaches, varying model sizes, dataset sizes, and autoencoder configurations. They evaluate the models based on training compute (FLOPs), final loss, CLIP score, and FID, analyzing scalability and trade-offs.	Token-based methods, especially next-token prediction, outperform diffusion on prompt following (CLIP score), indicating better controllability. While next-token prediction achieves better image quality (FID) at lower compute budgets, scaling trends suggest diffusion might eventually match it. Next-token prediction exhibits superior inference compute efficiency but can suffer from high latency in low-volume settings due to autoregressive sampling.	The study primarily focuses on pretraining and does not cover finetuning or distillation stages. The analysis is limited to loss and perceptual metrics, excluding potential downstream task evaluation or comparisons with other emerging approaches.	image synthesis, diffusion models, token prediction, transformers, compute efficiency
2405.13195 Report	CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers	Andrew Marmon, Grant Schindler, José Lezama, Dan Kondratyuk, Bryan Seybold, Irfan Essa	We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimensional camera movement over the course of the generated video. Results demonstrate that we are (1) able to successfully control the camera during video generation, starting from a single frame and a camera signal, and (2) we demonstrate the accuracy of the generated 3D camera paths using traditional computer vision methods.	This paper introduces a method for controlling 3D camera motion in video generation models by conditioning the output on an encoding of 3D camera movement.	This work addresses the limitations of existing video generation models, which often entangle scene dynamics and camera movement. Explicit control over camera motion enables more realistic and controllable video generation.	The authors extend a token-based video transformer model by incorporating 3D camera path information as a new modality. They generate training data using NeRF scenes to provide ground truth camera paths and fine-tune the model to follow these paths during video generation.	The method successfully controls the 3D camera movement during video generation, starting from a single frame and a camera signal. Generated videos exhibit parallax and maintain the in-painting and out-painting abilities of the pre-trained video generation model. There is a trade-off between controlling camera motion and preserving scene motion from the pre-trained model.	The model exhibits reduced scene motion when fine-tuned for camera control, likely due to the lack of scene dynamics in the NeRF training data. Future work could explore methods to better balance camera control and scene motion preservation, potentially by incorporating scene dynamics into the NeRF training data.	video generation, camera control, 3d motion, nerf, video transformer
2405.13194 Report	KPConvX: Modernizing Kernel Point Convolution with Kernel Attention	Hugues Thomas, Yao-Hung Hubert Tsai, Timothy D. Barfoot, Jian Zhang	In the field of deep point cloud understanding, KPConv is a unique architecture that uses kernel points to locate convolutional weights in space, instead of relying on Multi-Layer Perceptron (MLP) encodings. While it initially achieved success, it has since been surpassed by recent MLP networks that employ updated designs and training strategies. Building upon the kernel point principle, we present two novel designs: KPConvD (depthwise KPConv), a lighter design that enables the use of deeper architectures, and KPConvX, an innovative design that scales the depthwise convolutional weights of KPConvD with kernel attention values. Using KPConvX with a modern architecture and training strategy, we are able to outperform current state-of-the-art approaches on the ScanObjectNN, Scannetv2, and S3DIS datasets. We validate our design choices through ablation studies and release our code and models.	This paper presents KPConvX, an efficient point cloud feature extractor combining depthwise convolution and kernel attention, achieving state-of-the-art performance in semantic segmentation and shape classification.	Existing methods for deep point cloud understanding, including those based on MLPs or transformers, often struggle to efficiently capture geometric patterns. This work addresses this limitation.	The authors introduce KPConvD, a depthwise variant of KPConv, and further enhance it with kernel attention to create KPConvX. They design a modern deep architecture, KPConvX-L, using these novel operators.	KPConvX-L outperforms state-of-the-art methods on ScanObjectNN, Scannetv2, and S3DIS datasets. Ablation studies demonstrate the individual contributions of depthwise convolution, kernel attention, and architectural choices. The proposed approach achieves a good balance between high performance and computational efficiency.	Further research is needed to understand the interplay between topological and geometric feature extractors in deep learning. Exploring the combination of topological and geometric features in a single architecture is a promising avenue.	point cloud, deep learning, convolutional neural networks, attention mechanism, semantic segmentation
2405.12978 Report	Personalized Residuals for Concept-Driven Text-to-Image Generation	Cusuh Ham, Matthew Fisher, James Hays, Nicholas Kolkin, Yuchen Liu, Richard Zhang, Tobias Hinz	We present personalized residuals and localized attention-guided sampling for efficient concept-driven generation using text-to-image diffusion models. Our method first represents concepts by freezing the weights of a pretrained text-conditioned diffusion model and learning low-rank residuals for a small subset of the model's layers. The residual-based approach then directly enables application of our proposed sampling technique, which applies the learned residuals only in areas where the concept is localized via cross-attention and applies the original diffusion weights in all other regions. Localized sampling therefore combines the learned identity of the concept with the existing generative prior of the underlying diffusion model. We show that personalized residuals effectively capture the identity of a concept in ~3 minutes on a single GPU without the use of regularization images and with fewer parameters than previous models, and localized sampling allows using the original model as strong prior for large parts of the image.	This paper proposes 'personalized residuals' and 'localized attention-guided sampling' for efficient concept-driven generation using text-to-image diffusion models.	Existing text-to-image models struggle to consistently generate specific user-defined concepts in novel contexts. This work aims to improve the efficiency and controllability of concept-driven generation.	The method learns low-rank residuals for a subset of diffusion model layers to represent a specific concept. During sampling, these residuals can be applied locally based on cross-attention maps, allowing for better integration of the concept with the base diffusion model's generative capabilities.	Personalized residuals effectively capture concept identity using minimal parameters and training time, without needing regularization images. Localized attention-guided sampling enables better recontextualization of learned concepts by selectively applying personalized residuals based on attention maps. User studies and quantitative evaluations show the approach achieves comparable or better performance than existing baselines in terms of text alignment, image alignment, and user preference.	Localized sampling's effectiveness depends on the quality of cross-attention maps and may not be optimal for all types of prompts. The method can be sensitive to the choice of macro class used to represent the concept.	text-to-image generation, diffusion models, concept-driven synthesis, personalized residuals, attention-guided sampling
2405.12970 Report	Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control	Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu	Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.	This paper presents Face-Adapter, a plug-and-play adapter for pre-trained diffusion models that enables fine-grained control over identity and attributes for face reenactment and swapping tasks.	Existing GAN-based face editing methods have limitations in generative capabilities, while diffusion-based methods are resource-intensive to train. Face-Adapter leverages the power of pre-trained diffusion models while remaining efficient and achieving high-quality results.	Face-Adapter consists of three components: 1) Spatial Condition Generator predicts landmarks and adapts the foreground mask. 2) Identity Encoder transfers face embeddings to the text space. 3) Attribute Controller combines spatial and attribute information for conditional inpainting.	Face-Adapter achieves comparable or superior results in image quality and motion control accuracy for face reenactment compared to SOTA methods. For face swapping, Face-Adapter effectively handles large facial shape changes and large poses, outperforming existing methods in identity preservation and attribute consistency. The method is efficient and plug-and-play, only requiring fine-tuning of the adapter while freezing the pre-trained diffusion model.	The unified model lacks temporal stability for video face editing, which will be addressed in future work. Potential misuse of the technology for malicious purposes is a concern.	face reenactment, face swapping, diffusion model, conditional inpainting, face editing
2405.12914 Report	An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation	Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li	One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.	This paper proposes an efficient and effective three-stage training pipeline to integrate Large Language Models (LLMs) into text-to-image diffusion models for enhanced language understanding and multilingual generation.	Existing text-to-image models often rely on CLIP's text encoder, limiting them to English input, short prompts, and potentially hindering generation quality due to CLIP's smaller capacity compared to LLMs.	The pipeline consists of: (1) aligning LLM text features with CLIP's visual-textual space using a lightweight adapter, (2) end-to-end text-image training to optimize the adapter and the diffusion model, and (3) fine-tuning on a high-quality dataset for improved aesthetics.	The model achieves competitive FID/CLIP scores on various benchmarks, demonstrating high synthesis quality and text alignment. Supports multilingual text-to-image generation, including Chinese, Japanese, and Korean. Successfully generates images from longer prompts, exceeding CLIP's limitation of 77 tokens.	Human evaluation, while showing preference for the model's outputs, was limited in scale and inherently subjective. The model's performance depends on the quality and diversity of the training data, potentially struggling with objects or concepts not well-represented in the data.	text-to-image generation, large language models, diffusion models, multilingual generation, long-prompt generation
2405.12806 Report	MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video	Hongsheng Wang, Xiang Cai, Xi Sun, Jinhong Yue, Shengyu Zhang, Feng Lin, Fei Wu	Single-view clothed human reconstruction holds a central position in virtual reality applications, especially in contexts involving intricate human motions. It presents notable challenges in achieving realistic clothing deformation. Current methodologies often overlook the influence of motion on surface deformation, resulting in surfaces lacking the constraints imposed by global motion. To overcome these limitations, we introduce an innovative framework, Motion-Based 3D Clothed Humans Synthesis (MOSS), which employs kinematic information to achieve motion-aware Gaussian split on the human surface. Our framework consists of two modules: Kinematic Gaussian Locating Splatting (KGAS) and Surface Deformation Detector (UID). KGAS incorporates matrix-Fisher distribution to propagate global motion across the body surface. The density and rotation factors of this distribution explicitly control the Gaussians, thereby enhancing the realism of the reconstructed surface. Additionally, to address local occlusions in single-view, based on KGAS, UID identifies significant surfaces, and geometric reconstruction is performed to compensate for these deformations. Experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos. Notably, we improve the Human NeRF and the Gaussian Splatting by 33.94% and 16.75% in LPIPS* respectively. Codes are available at https://wanghongsheng01.github.io/MOSS/.	This paper presents MOSS, a novel framework for high-quality, motion-aware 3D clothed human reconstruction from monocular videos using Gaussian Splatting.	Existing methods struggle to realistically reconstruct fine details like clothing folds and joint deformations, especially under large-scale motions.	MOSS uses two key modules: KGAS to control Gaussian density and orientation based on motion information extracted from the SMPL kinematic tree, and UID to detect and refine significant surface deformations.	MOSS achieves state-of-the-art visual quality on ZJU-MoCap and MonoCap datasets, outperforming previous methods in LPIPS* and PSNR. KGAS effectively incorporates global motion constraints, leading to realistic joint details and clothing folds. UID enhances the reconstruction of complex surface deformations by identifying and densifying Gaussians in those areas.	MOSS currently relies on pre-computed SMPL parameters and camera information. Future work includes incorporating graph-based topological guidance for improved reconstruction.	3d gaussian splatting, human reconstruction, surface deformation, motion-aware, single-view
2405.12796 Report	DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control	Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu	Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.	Proposes DisenStudio, a novel framework for generating customized videos with multiple user-provided subjects and their desired actions, addressing limitations of existing single-subject methods.	Existing methods struggle to generate videos with multiple customized subjects due to subject-missing, attribute-binding, and action-binding problems, hindering the creation of diverse and personalized video content.	Enhances a pretrained diffusion-based text-to-video model with spatial-disentangled cross-attention (SDCA) to independently control subjects and their actions. Introduces motion-preserved disentangled finetuning with multi-subject co-occurrence, masked single-subject, and motion-preserved tuning strategies to ensure subject presence, preserve visual attributes, and maintain motion generation ability.	Significantly outperforms baselines in subject fidelity (DINO), textual alignment (CLIP-T), and temporal consistency. Enables precise control over subject actions and positions within the video. Demonstrates potential for various applications, including storytelling with customized characters.	Limited to the base model's video length and resolution, hindering the generation of longer videos with more complex scenarios and higher subject fidelity. Relies on the pretrained model's motion repertoire, limiting customization of specific subject motions.	text-to-video generation, subject customization, diffusion models, disentanglement, spatial control
2405.12663 Report	LAGA: Layered 3D Avatar Generation and Customization via Gaussian Splatting	Jia Gong, Shenyu Ji, Lin Geng Foo, Kang Chen, Hossein Rahmani, Jun Liu	Creating and customizing a 3D clothed avatar from textual descriptions is a critical and challenging task. Traditional methods often treat the human body and clothing as inseparable, limiting users' ability to freely mix and match garments. In response to this limitation, we present LAyered Gaussian Avatar (LAGA), a carefully designed framework enabling the creation of high-fidelity decomposable avatars with diverse garments. By decoupling garments from avatar, our framework empowers users to conviniently edit avatars at the garment level. Our approach begins by modeling the avatar using a set of Gaussian points organized in a layered structure, where each layer corresponds to a specific garment or the human body itself. To generate high-quality garments for each layer, we introduce a coarse-to-fine strategy for diverse garment generation and a novel dual-SDS loss function to maintain coherence between the generated garments and avatar components, including the human body and other garments. Moreover, we introduce three regularization losses to guide the movement of Gaussians for garment transfer, allowing garments to be freely transferred to various avatars. Extensive experimentation demonstrates that our approach surpasses existing methods in the generation of 3D clothed humans.	This paper introduces LAGA, a novel framework for generating layered 3D avatars with diverse, interchangeable garments based on Gaussian Splatting.	Existing 3D avatar generation methods often lack the ability to decompose garments from the avatar itself, limiting customization options.	LAGA employs a layered structure, representing the body and each garment as separate layers of Gaussian points. It utilizes a coarse-to-fine strategy for diverse garment generation and a dual-SDS loss function for maintaining coherence between different layers. Furthermore, it introduces three regularization losses to enable garment transfer between avatars with different body shapes.	LAGA generates high-quality 3D avatars with realistic textures and detailed features. The layered structure enables convenient decomposition and customization of garments. LAGA outperforms existing methods in qualitative and quantitative comparisons, demonstrating superior visual fidelity and text alignment.	LAGA currently relies on a pre-trained 2D human skeleton conditioned diffusion model, limiting its generalization ability to unseen poses. The garment transfer method could be further improved to handle extreme body shape variations.	3d avatar generation, gaussian splatting, decomposable avatars, garment transfer, text-to-3d
2405.12661 Report	EmoEdit: Evoking Emotions through Image Manipulation	Jingyuan Yang, Jiawei Feng, Weibin Luo, Dani Lischinski, Daniel Cohen-Or, Hui Huang	Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses. This task is inherently complex due to its twofold objective: significantly evoking the intended emotion, while preserving the original image composition. Existing AIM methods primarily adjust color and style, often failing to elicit precise and profound emotional shifts. Drawing on psychological insights, we extend AIM by incorporating content modifications to enhance emotional impact. We introduce EmoEdit, a novel two-stage framework comprising emotion attribution and image editing. In the emotion attribution stage, we leverage a Vision-Language Model (VLM) to create hierarchies of semantic factors that represent abstract emotions. In the image editing stage, the VLM identifies the most relevant factors for the provided image, and guides a generative editing model to perform affective modifications. A ranking technique that we developed selects the best edit, balancing between emotion fidelity and structure integrity. To validate EmoEdit, we assembled a dataset of 416 images, categorized into positive, negative, and neutral classes. Our method is evaluated both qualitatively and quantitatively, demonstrating superior performance compared to existing state-of-the-art techniques. Additionally, we showcase EmoEdit's potential in various manipulation tasks, including emotion-oriented and semantics-oriented editing.	EmoEdit, a novel two-stage framework for Affective Image Manipulation (AIM), modifies image content and color to evoke specific emotions while preserving original structure.	Existing AIM methods, limited to color and style adjustments, struggle to evoke precise emotions. EmoEdit addresses this by incorporating content modification based on psychological insights.	EmoEdit utilizes emotion factor trees built from EmoSet to map emotions to visual elements. It employs GPT-4V for factor selection and instruction generation, IP2P for editing, and a ranking technique to select the optimal result.	EmoEdit outperforms state-of-the-art methods in emotion fidelity, structure preservation, and user preference. Content and color modification, along with ranking, are crucial for EmoEdit's effectiveness. EmoEdit enables diverse editing across eight emotion categories and various manipulation levels.	The emotion factor tree's reliance on EmoSet might introduce bias and limitations. Fixed filtering and ranking in EmoEdit could be enhanced by incorporating user interaction.	affective image manipulation, emotion elicitation, image editing, content modification, vision-language model
2405.12540 Report	Context-Enhanced Video Moment Retrieval with Large Language Models	Weijia Liu, Bo Miao, Jiuxin Cao, Xuelin Zhu, Bo Liu, Mehwish Nasim, Ajmal Mian	Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28\% and 4.06\% on the challenging QVHighlights and Charades-STA benchmarks, respectively. More importantly, the performance gains are significantly higher for localization of complex queries.	This paper presents LMR, a novel Video Moment Retrieval (VMR) approach that leverages the power of Large Language Models (LLMs) to enhance video context modeling and improve the accuracy of retrieving specific moments from videos based on complex textual queries.	Existing VMR methods struggle to accurately localize target moments described by intricate queries involving environmental details, character descriptions, and action narratives. LMR addresses this challenge by integrating LLM-derived contextual information, enabling more precise alignment between videos and complex language queries.	LMR employs an LLM to generate target-related textual descriptions for video clips, enriching their contextual representation. These descriptions, along with visual features, are processed by a language-conditioned transformer to decode free-form language queries and localize the target moment.	LMR achieves state-of-the-art results on the QVHighlights and Charades-STA benchmarks, outperforming existing methods. The approach demonstrates significant performance gains on complex queries, highlighting its ability to handle intricate contextual requirements. Ablation studies validate the contributions of individual components, demonstrating the importance of LLM-based context enhancement and language-conditioned decoding.	The current implementation relies on offline LLM processing for generating video descriptions, which could be integrated into an end-to-end trainable framework in future work. Exploring alternative LLM architectures and prompting strategies for generating even richer and more informative video descriptions could further improve performance.	video moment retrieval, large language models, multimodal alignment, context enhancement, language-conditioned transformer
2405.12531 Report	CustomText: Customized Textual Image Generation using Diffusion Models	Shubham Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, Lovekesh Vig	Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models. We call our proposed method CustomText. Our implementation leverages a pre-trained TextDiffuser model to enable control over font color, background, and types. Additionally, to address the challenge of accurately rendering small-sized fonts, we train the ControlNet model for a consistency decoder, significantly enhancing text-generation performance. We assess the performance of CustomText in comparison to previous methods of textual image generation on the publicly available CTW-1500 dataset and a self-curated dataset for small-text generation, showcasing superior results.	This paper introduces CustomText, a novel method leveraging diffusion models to generate images with customized text, offering control over font attributes like type, color, size, and background for seamless integration into diverse layouts.	Existing text-to-image synthesis methods struggle with accurate text rendering and lack control over font attributes, limiting their use in applications like advertising, education, and product packaging where customized text is crucial.	CustomText utilizes a two-stage pipeline: first generating character and conditional masks to define text position and attributes, then using a modified TextDiffuser model with a ControlNet-based consistency decoder for enhanced small-font generation.	CustomText demonstrates superior control over text attributes, enabling customization of font type, color, size, and background. The ControlNet-based consistency decoder significantly improves the generation of small-sized fonts compared to previous methods. Quantitative evaluations using MSE, PSNR, SSIM, and OCR performance on CTW-1500 and a custom SmallFontSize dataset confirm the effectiveness of CustomText.	The current system only supports Latin alphabets, limiting its applicability to other languages. The training dataset size for the decoder enhance model is limited, potentially hindering performance. Future work involves using a larger dataset for further improvement.	text-to-image synthesis, diffusion models, font customization, text rendering, controlnet
2405.12523 Report	Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models	Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi	Machine unlearning empowers individuals with the `right to be forgotten' by removing their private or sensitive information encoded in machine learning models. However, it remains uncertain whether MU can be effectively applied to Multimodal Large Language Models (MLLMs), particularly in scenarios of forgetting the leaked visual data of concepts. To overcome the challenge, we propose an efficient method, Single Image Unlearning (SIU), to unlearn the visual recognition of a concept by fine-tuning a single associated image for few steps. SIU consists of two key aspects: (i) Constructing Multifaceted fine-tuning data. We introduce four targets, based on which we construct fine-tuning data for the concepts to be forgotten; (ii) Jointly training loss. To synchronously forget the visual recognition of concepts and preserve the utility of MLLMs, we fine-tune MLLMs through a novel Dual Masked KL-divergence Loss combined with Cross Entropy loss. Alongside our method, we establish MMUBench, a new benchmark for MU in MLLMs and introduce a collection of metrics for its evaluation. Experimental results on MMUBench show that SIU completely surpasses the performance of existing methods. Furthermore, we surprisingly find that SIU can avoid invasive membership inference attacks and jailbreak attacks. To the best of our knowledge, we are the first to explore MU in MLLMs. We will release the code and benchmark in the near future.	This paper explores machine unlearning (MU) in Multimodal Large Language Models (MLLMs), focusing on forgetting the visual recognition of concepts and introduces a new method called Single Image Unlearning (SIU).	This is important because existing MU methods for LLMs may not be transferable to MLLMs, especially when dealing with limited training data and potential model degradation when forgetting visual concepts.	SIU uses a single image of a target concept to unlearn its visual recognition. It employs Multifaceted Fine-tuning Data based on four targets (aligning with unseen concepts, assigning new visual descriptions, decoupling factual knowledge, and preserving non-targeted knowledge) and a Dual Masked KL-divergence (DMK) Loss jointly trained with cross-entropy loss to refine the unlearning process and preserve model utility.	SIU outperforms existing methods (PO, GA, GA+KL) on the proposed MMUBench benchmark in terms of efficacy, generality, specificity, fluency, and diversity. SIU demonstrates robustness against membership inference attacks and jailbreak attacks. The research reveals a 'positive butterfly effect' where unlearning a concept can lead to the selective retention of related knowledge, suggesting a nuanced restructuring of knowledge within the model.	The study primarily focuses on the LLAVA model, potentially limiting the generalizability of the findings to other MLLMs. Future work will explore new MU methods in MLLMs and evaluate unlearning for specific data points rather than concept-wise knowledge.	machine unlearning, multimodal large language models, visual recognition, benchmarking, privacy
2405.12490 Report	Customize Your Own Paired Data via Few-shot Way	Jinshu Chen, Bingchuan Li, Miao Hua, Panpan Xu, Qian He	Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and behaving badly in out-of-distribution cases. The task we focus on is how to enable the users to customize their desired effects through only few image pairs. In our proposed framework, a novel few-shot learning mechanism based on the directional transformations among samples is introduced and expands the learnable space exponentially. Adopting a diffusion model pipeline, we redesign the condition calculating modules in our model and apply several technical improvements. Experimental results demonstrate the capabilities of our method in various cases.	This paper proposes a novel few-shot image editing framework allowing users to customize image editing effects with only a few image pairs.	Existing methods either require large paired datasets or rely heavily on pre-trained models, limiting their flexibility and applicability to new editing tasks.	The method utilizes a novel "n-source-to-n-target" learning mechanism, expanding the dataset by training on directional transformations within sample pairs. It adopts a diffusion model pipeline with redesigned condition injection modules, incorporating pixel-level transformations as conditions, and employs technical improvements like adaptive noise and skip connections for enhanced generation quality.	The framework achieves comparable performance to existing paired-data methods with only 1% of the training data. It avoids disentanglement issues present in latent space editing methods, preserving areas outside the editing target. The framework is not limited by pre-trained priors, enabling the creation of new editing effects beyond existing datasets.	The paper acknowledges limitations in handling high-resolution images due to computational constraints. Future work includes exploring the application of the framework to other image editing tasks beyond those presented.	image editing, few-shot learning, diffusion models, customization, paired data
2405.12399 Report	Diffusion for World Modeling: Visual Details Matter in Atari	Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret	World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. To foster future research on diffusion for world modeling, we release our code, agents and playable world models at https://github.com/eloialonso/diamond.	Introduces DIAMOND, a reinforcement learning agent trained within a diffusion world model for improved sample efficiency and visual fidelity.	Addresses limitations of discrete latent-based world models, which can lose visual details crucial for complex tasks, by leveraging the strengths of diffusion models in high-fidelity image generation.	Implements a diffusion model conditioned on past observations and actions to predict future observations, employing EDM over DDPM for stability with fewer denoising steps. Trains an actor-critic RL agent within this imagined environment.	Achieves state-of-the-art mean human-normalized score (1.46) on the Atari 100k benchmark among world model agents. Demonstrates greater stability over longer time horizons compared to DDPM-based world models. Generates visually consistent and higher-quality imagined trajectories compared to discrete latent-based models like IRIS.	Evaluation primarily focuses on discrete control environments (Atari), with limited exploration of continuous control tasks. Relies on simple frame stacking for observation history, potentially limiting long-term memory and scalability compared to transformer-based architectures.	world models, diffusion models, reinforcement learning, atari, generative vision models
2405.12369 Report	AtomGS: Atomizing Gaussian Splatting for High-Fidelity Radiance Field	Rong Liu, Rui Xu, Yue Hu, Meida Chen, Andrew Feng	3D Gaussian Splatting (3DGS) has recently advanced radiance field reconstruction by offering superior capabilities for novel view synthesis and real-time rendering speed. However, its strategy of blending optimization and adaptive density control might lead to sub-optimal results; it can sometimes yield noisy geometry and blurry artifacts due to prioritizing optimizing large Gaussians at the cost of adequately densifying smaller ones. To address this, we introduce AtomGS, consisting of Atomized Proliferation and Geometry-Guided Optimization. The Atomized Proliferation constrains ellipsoid Gaussians of various sizes into more uniform-sized Atom Gaussians. The strategy enhances the representation of areas with fine features by placing greater emphasis on densification in accordance with scene details. In addition, we proposed a Geometry-Guided Optimization approach that incorporates an Edge-Aware Normal Loss. This optimization method effectively smooths flat surfaces while preserving intricate details. Our evaluation shows that AtomGS outperforms existing state-of-the-art methods in rendering quality. Additionally, it achieves competitive accuracy in geometry reconstruction and offers a significant improvement in training speed over other SDF-based methods. More interactive demos can be found in our website (https://rongliu-leo.github.io/AtomGS/).	AtomGS, a novel approach for radiance field reconstruction, enhances 3D Gaussian Splatting by emphasizing uniform densification through Atomized Proliferation and refining surface details via Geometry-Guided Optimization.	Existing 3DGS methods often prioritize optimizing large Gaussians over densifying smaller ones, leading to noisy geometry and blurry artifacts, especially in areas with fine details. This work addresses these limitations by improving the alignment of Gaussians with the underlying scene geometry.	AtomGS introduces two key components: (1) Atomized Proliferation, which constrains smaller Gaussians into uniformly-sized Atom Gaussians to prioritize densification in detail-rich areas, and (2) Geometry-Guided Optimization, incorporating an Edge-Aware Normal Loss to smooth flat surfaces while preserving intricate details.	AtomGS outperforms state-of-the-art methods in rendering quality on Mip-NeRF360 and Tanks & Temples datasets. It achieves competitive accuracy in geometry reconstruction on the DTU dataset, surpassing other explicit methods and rivaling implicit SDF-based methods. AtomGS demonstrates significant improvement in training speed compared to SDF-based methods.	AtomGS might struggle with highly specular or semi-transparent materials. The current pruning strategy could be further improved to achieve a more compact representation, especially in highly complex environments.	radiance field reconstruction, 3d gaussian splatting, novel view synthesis, geometry-guided optimization, atomized proliferation
2405.12218 Report	Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo	Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu	We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume rendering design for novel view synthesis. 3) To support fast fine-tuning for specific scenes, we introduce a multi-view geometric consistent aggregation strategy to effectively aggregate the point clouds generated by the generalizable model, serving as the initialization for per-scene optimization. Compared with previous generalizable NeRF-based methods, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering with better synthesis quality for each scene. Compared with the vanilla 3D-GS, MVSGaussian achieves better view synthesis with less training computational cost. Extensive experiments on DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples datasets validate that MVSGaussian attains state-of-the-art performance with convincing generalizability, real-time rendering speed, and fast per-scene optimization.	MVSGaussian, a novel generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS), enables efficient reconstruction of unseen scenes.	Existing generalizable Gaussian Splatting methods are inefficient, limited to object-centric reconstruction, and restricted in input types. This work addresses these limitations by proposing an efficient framework for novel view synthesis in unseen general scenes.	The method leverages MVS for geometry reasoning and feature encoding, establishing a pixel-aligned Gaussian representation. It then employs a hybrid Gaussian rendering approach, integrating depth-aware volume rendering for enhanced generalization. For per-scene optimization, a multi-view geometric consistent aggregation strategy provides high-quality initialization.	MVSGaussian outperforms other generalizable methods in terms of rendering quality and speed. It achieves comparable or even superior performance to state-of-the-art methods after a short per-scene optimization. The method enables real-time rendering with faster optimization compared to existing generalizable NeRFs and vanilla 3D-GS.	The reliance on MVS for depth estimation can lead to decreased accuracy in areas with weak textures or specular reflections. Future work may explore improving depth estimation accuracy in challenging regions.	generalizable gaussian splatting, multi-view stereo, neural radiance field, novel view synthesis, real-time rendering
2405.12200 Report	Multi-View Attentive Contextualization for Multi-View 3D Object Detection	Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, Tianfu Wu	We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".	This paper introduces Multi-View Attentive Contextualization (MvACon), a plug-and-play module designed to enhance 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection.	Existing MV3D detectors suffer from limitations in effectively capturing 3D information during feature lifting. MvACon addresses these limitations by incorporating global and semantically meaningful 3D awareness.	MvACon utilizes a cluster-attention mechanism, adapted from PaCa (Patch-to-Cluster attention), to contextualize 2D features. It expands the traditional three-component MV3D detection pipeline to a four-component setup by adding an attentive contextualization stage.	MvACon consistently improves the performance of various query-based MV3D detectors, including PETR and BEVFormer, on both NuScenes and Waymo datasets. It significantly enhances localization, orientation, and velocity prediction in these detectors. Qualitative analysis shows that MvACon learns stable and semantically meaningful representations of the scene, contributing to its improved performance.	The computational cost of MvACon in the full model might be high. Future work includes exploring alternative clustering techniques for improved efficiency.	3d object detection, multi-view vision, attentive contextualization, feature lifting, autonomous driving
2405.12155 Report	Embracing Radiance Field Rendering in 6G: Over-the-Air Training and Inference with 3D Contents	Guanlin Wu, Zhonghao Lyu, Juyong Zhang, Jie Xu	The efficient representation, transmission, and reconstruction of three-dimensional (3D) contents are becoming increasingly important for sixth-generation (6G) networks that aim to merge virtual and physical worlds for offering immersive communication experiences. Neural radiance field (NeRF) and 3D Gaussian splatting (3D-GS) have recently emerged as two promising 3D representation techniques based on radiance field rendering, which are able to provide photorealistic rendering results for complex scenes. Therefore, embracing NeRF and 3D-GS in 6G networks is envisioned to be a prominent solution to support emerging 3D applications with enhanced quality of experience. This paper provides a comprehensive overview on the integration of NeRF and 3D-GS in 6G. First, we review the basics of the radiance field rendering techniques, and highlight their applications and implementation challenges over wireless networks. Next, we consider the over-the-air training of NeRF and 3D-GS models over wireless networks by presenting various learning techniques. We particularly focus on the federated learning design over a hierarchical device-edge-cloud architecture. Then, we discuss three practical rendering architectures of NeRF and 3D-GS models at wireless network edge. We provide model compression approaches to facilitate the transmission of radiance field models, and present rendering acceleration approaches and joint computation and communication designs to enhance the rendering efficiency. In particular, we propose a new semantic communication enabled 3D content transmission design, in which the radiance field models are exploited as the semantic knowledge base to reduce the communication overhead for distributed inference. Furthermore, we present the utilization of radiance field rendering in wireless applications like radio mapping and radio imaging.	This paper provides a comprehensive overview of integrating Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3D-GS) rendering techniques into 6G networks for immersive communication experiences.	NeRF and 3D-GS are revolutionary for representing and transmitting 3D content, crucial for immersive 6G applications like XR and telepresence.	The paper explores various aspects, including centralized/distributed learning for NeRF/3D-GS, a hierarchical device-edge-cloud architecture for federated learning, model compression/acceleration, joint computation/communication design, and semantic communication for efficient rendering.	Federated learning over a hierarchical architecture enables efficient training of large-scale scene radiance fields. Model compression and algorithmic acceleration techniques enhance the transmission and rendering efficiency. The proposed semantic communication framework for 3D content transmission, using NeRF as a semantic knowledge base, significantly reduces communication overhead.	The paper mainly focuses on the technical feasibility of integrating NeRF/3D-GS in 6G, without delving into specific protocol design or standardization aspects. Future work could investigate asynchronous federated learning, generalizable models, and the use of over-the-air computation for efficient model aggregation.	6g, immersive communications, neural radiance field (nerf), 3d gaussian splatting (3d-gs), federated learning
2405.12110 Report	CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization	Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, Jin Zheng, Xiao Bai	3D Gaussian Splatting (3DGS) creates a radiance field consisting of 3D Gaussians to represent a scene. With sparse training views, 3DGS easily suffers from overfitting, negatively impacting the reconstruction quality. This paper introduces a new co-regularization perspective for improving sparse-view 3DGS. When training two 3D Gaussian radiance fields with the same sparse views of a scene, we observe that the two radiance fields exhibit \textit{point disagreement} and \textit{rendering disagreement} that can unsupervisedly predict reconstruction quality, stemming from the sampling implementation in densification. We further quantify the point disagreement and rendering disagreement by evaluating the registration between Gaussians' point representations and calculating differences in their rendered pixels. The empirical study demonstrates the negative correlation between the two disagreements and accurate reconstruction, which allows us to identify inaccurate reconstruction without accessing ground-truth information. Based on the study, we propose CoR-GS, which identifies and suppresses inaccurate reconstruction based on the two disagreements: (\romannumeral1) Co-pruning considers Gaussians that exhibit high point disagreement in inaccurate positions and prunes them. (\romannumeral2) Pseudo-view co-regularization considers pixels that exhibit high rendering disagreement are inaccurately rendered and suppress the disagreement. Results on LLFF, Mip-NeRF360, DTU, and Blender demonstrate that CoR-GS effectively regularizes the scene geometry, reconstructs the compact representations, and achieves state-of-the-art novel view synthesis quality under sparse training views.	This paper investigates the behavior disagreement between two 3D Gaussian Radiance Fields (3DGRFs) trained on the same scene with sparse views, and proposes a novel co-regularization method, CoR-GS, to improve sparse-view 3D Gaussian Splatting.	3D Gaussian Splatting (3DGS) suffers from overfitting with sparse training views, leading to degraded novel view synthesis quality. This work provides a new perspective on regularizing sparse-view 3DGS by leveraging the disagreement between different 3DGRFs.	The authors simultaneously train two 3DGRFs with the same sparse views. They introduce "point disagreement" and "rendering disagreement" to quantify the differences between Gaussian positions and rendered results of the two fields. They then propose co-pruning to suppress point disagreement and pseudo-view co-regularization to suppress rendering disagreement.	Two 3DGRFs trained with the same sparse views exhibit significant point and rendering disagreements, particularly during densification. The disagreements are negatively correlated with accurate scene reconstruction, providing an unsupervised way to identify inaccurate reconstruction. CoR-GS effectively suppresses the disagreements, reconstructing more compact geometry representations and achieving state-of-the-art novel view synthesis quality on multiple benchmarks.	Color co-regularization implicitly handles depth information, making explicit depth co-regularization less effective. More advanced co-regularization strategies could further improve the performance, particularly in handling complex scenes.	3d gaussian splatting, radiance fields, novel view synthesis, sparse view reconstruction, co-regularization
2405.12107 Report	Imp: Highly Capable Large Multimodal Models for Mobile Devices	Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Jiajun Ding	By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.	This paper introduces Imp, a family of lightweight Large Multimodal Models (LMMs) at the 2B/3B/4B parameter scales, demonstrating that carefully designed lightweight LMMs can achieve competitive performance compared to larger counterparts.	Building lightweight LMMs is crucial for enabling wider access to this technology for researchers with limited resources and for deployment on resource-constrained devices like PCs and mobile phones.	The authors systematically explore the design space of lightweight LMMs, investigating the impact of model architecture (LLM and visual encoder choices), training strategy (fine-tuning mechanism and the number of training epochs), and augmented training data (OCR, chart-oriented, and GPT4V-annotated) on model performance.	Imp-3B significantly outperforms existing open-source lightweight LMMs of similar size and achieves comparable performance to state-of-the-art 13B LMMs on various benchmarks. The study highlights the importance of high-quality training data for lightweight LMMs, showing that quality often outweighs quantity in this context. Imp models can be effectively deployed on mobile devices, particularly Imp-3B@196 with 4-bit quantization, which balances a small model size with low latency and strong capabilities.	The model currently only supports English inputs and requires further development for multilingual capabilities. Future work will focus on improving performance in specific tasks like OCR and object counting, incorporating more efficient training and compression techniques, and expanding to other modalities such as audio and 3D.	large multimodal models, lightweight models, vision-language models, model efficiency, mobile deployment
2405.12069 Report	Gaussian Head & Shoulders: High Fidelity Neural Upper Body Avatars with Anchor Gaussian Guided Texture Warping	Tianhao Wu, Jing Yang, Zhilin Guo, Jingyi Wan, Fangcheng Zhong, Cengiz Oztireli	By equipping the most recent 3D Gaussian Splatting representation with head 3D morphable models (3DMM), existing methods manage to create head avatars with high fidelity. However, most existing methods only reconstruct a head without the body, substantially limiting their application scenarios. We found that naively applying Gaussians to model the clothed chest and shoulders tends to result in blurry reconstruction and noisy floaters under novel poses. This is because of the fundamental limitation of Gaussians and point clouds -- each Gaussian or point can only have a single directional radiance without spatial variance, therefore an unnecessarily large number of them is required to represent complicated spatially varying texture, even for simple geometry. In contrast, we propose to model the body part with a neural texture that consists of coarse and pose-dependent fine colors. To properly render the body texture for each view and pose without accurate geometry nor UV mapping, we optimize another sparse set of Gaussians as anchors that constrain the neural warping field that maps image plane coordinates to the texture space. We demonstrate that Gaussian Head & Shoulders can fit the high-frequency details on the clothed upper body with high fidelity and potentially improve the accuracy and fidelity of the head region. We evaluate our method with casual phone-captured and internet videos and show our method archives superior reconstruction quality and robustness in both self and cross reenactment tasks. To fully utilize the efficient rendering speed of Gaussian splatting, we additionally propose an accelerated inference method of our trained model without Multi-Layer Perceptron (MLP) queries and reach a stable rendering speed of around 130 FPS for any subjects.	This paper introduces "Gaussian Head & Shoulders", a method for reconstructing high-fidelity, animatable upper body avatars from monocular videos using Gaussian Splatting for the head and a learned texture map guided by anchor Gaussians for the body.	Existing methods struggle to realistically capture the complex textures and deformations of clothed upper bodies, limiting their use in immersive applications.	The method combines 3D Gaussian Splatting with a neural texture map. Sparse anchor Gaussians, driven by a head 3DMM, constrain a neural warping field that maps image pixels to the texture space, enabling high-frequency detail rendering. An accelerated inference method bypasses MLP queries for real-time performance.	Outperforms baselines in self-reenactment tasks, achieving higher fidelity and robustness, especially for subjects with intricate clothing. Demonstrates improved expression control compared to pure Gaussian Splatting methods due to the focused modeling of the head region. Achieves a rendering speed of around 130 FPS with the accelerated inference method, surpassing pure Gaussian Splatting for subjects with complex clothing.	The method cannot model avatars with extreme body rotations that lead to self-occlusion. The accelerated inference relies on rigid transformations and may not capture non-rigid body deformations accurately.	neural avatars, gaussian splatting, texture mapping, 3d reconstruction, monocular video
2405.11921 Report	MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections	Jiayue Liu, Xiao Tang, Freeman Cheng, Roy Yang, Zhihao Li, Jianzhuang Liu, Yi Huang, Jiaqi Lin, Shiyong Liu, Xiaofei Wu, Songcen Xu, Chun Yuan	3D Gaussian Splatting showcases notable advancements in photo-realistic and real-time novel view synthesis. However, it faces challenges in modeling mirror reflections, which exhibit substantial appearance variations from different viewpoints. To tackle this problem, we present MirrorGaussian, the first method for mirror scene reconstruction with real-time rendering based on 3D Gaussian Splatting. The key insight is grounded on the mirror symmetry between the real-world space and the virtual mirror space. We introduce an intuitive dual-rendering strategy that enables differentiable rasterization of both the real-world 3D Gaussians and the mirrored counterpart obtained by reflecting the former about the mirror plane. All 3D Gaussians are jointly optimized with the mirror plane in an end-to-end framework. MirrorGaussian achieves high-quality and real-time rendering in scenes with mirrors, empowering scene editing like adding new mirrors and objects. Comprehensive experiments on multiple datasets demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art results. Project page: https://mirror-gaussian.github.io/.	MirrorGaussian is the first method to achieve high-fidelity reconstruction and real-time rendering of scenes containing mirrors using 3D Gaussian Splatting.	Existing NVS methods struggle with reconstructing mirror reflections due to their high specularity and viewpoint variation, which are difficult to model with MLPs or SH functions. NeRF-based solutions are computationally expensive, hindering interactive applications.	MirrorGaussian leverages the mirror symmetry between the real world and virtual mirror space. It uses a dual-rendering strategy: 1) rendering the real-world scene from 3D Gaussians, 2) rendering the mirror image by reflecting the 3D Gaussians across an estimated and optimized mirror plane. A mirror label is introduced to enable differentiable mirror mask generation from arbitrary viewpoints.	MirrorGaussian significantly outperforms existing NeRF-based methods in terms of both rendering quality and speed, achieving state-of-the-art results. It enables real-time novel view synthesis at high resolution, thanks to efficient point-based rasterization. The explicit point cloud representation allows for scene editing, such as adding new objects and mirrors.	MirrorGaussian requires mirror segmentation on input images for mirror plane and mask estimation. The current dual-rendering strategy slightly decreases rendering speed, which can be further optimized.	novel view synthesis, mirror reflections, 3d gaussian splatting, real-time rendering, scene editing
2405.11914 Report	PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images	Yiheng Xiong, Angela Dai	Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios	This paper proposes a transformer-based autoregressive model for generating a probabilistic distribution of 3D shapes from a single RGB image, especially those with occlusion or truncation.	Generating 3D shapes from single RGB images is crucial for robotics and computer vision, but existing methods struggle with images containing ambiguous observations like occlusion or truncation. This work addresses this challenge by generating multiple plausible 3D shapes.	The approach compresses 3D shapes into a low-dimensional latent representation using P-VQ-VAE. Then, a transformer model with cross-attention learns the distribution of these representations conditioned on an input image. The model is trained on a synthetic dataset with multiple ground-truth shapes per image to handle ambiguity and then fine-tuned on real-world data.	The proposed method outperforms state-of-the-art methods in terms of shape generation quality on both synthetic and real-world datasets. The model generates multiple plausible 3D shape hypotheses that align well with the input image, demonstrating its ability to handle ambiguity. Pretraining on the synthetic dataset with multiple ground-truth shapes per image is shown to be effective, significantly improving performance on real-world data.	The generation scale is currently limited to the object level, and expanding it to the scene level is left for future work. The diversity of generated shapes, while reasonable, is not as high as some existing methods, indicating a potential trade-off between diversity and alignment with the input image.	3d shape generation, single-view reconstruction, probabilistic modeling, transformers, cross-attention
2405.11852 Report	Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models	Xiyu Wang, Yufei Wang, Satoshi Tsutsui, Weisi Lin, Bihan Wen, Alex C. Kot	Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potential character leakage and inconsistent text labeling, and (2) the challenge of distinguishing between new and old characters, leading to ambiguous results. To address these challenges, we introduce the NewEpisode benchmark, comprising refined datasets designed to evaluate generative models' adaptability in generating new stories with fresh characters using just a single example story. The refined dataset involves refined text prompts and eliminates character leakage. Additionally, to mitigate the character confusion of generated results, we propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics. EpicEvo introduces a novel adversarial character alignment module to align the generated images progressively in the diffusive process, with exemplar images of new characters, while applying knowledge distillation to prevent forgetting of characters and background details. Our evaluation quantitatively demonstrates that EpicEvo outperforms existing baselines on the NewEpisode benchmark, and qualitative studies confirm its superior customization of visual story generation in diffusion models. In summary, EpicEvo provides an effective way to incorporate new characters using only one example story, unlocking new possibilities for applications such as serialized cartoons.	This paper introduces the NewEpisode benchmark for evaluating the ability of generative models to incorporate new characters into existing narratives, and proposes EpicEvo, a method for customizing diffusion-based visual story generation models to include new characters using just a single example story.	The ability to seamlessly integrate new characters into established stories is crucial for applications like creating new episodes of comic books or cartoons, but existing models struggle with this due to limited data and the risk of disrupting established character dynamics.	The NewEpisode benchmark is created by refining existing datasets to include unseen characters in the test set. EpicEvo uses adversarial character alignment to encourage distinct generation of new characters and knowledge distillation to preserve the model's priors and prevent overfitting.	EpicEvo outperforms existing baselines on the NewEpisode benchmark in terms of FID score, indicating better new character consistency. Qualitative analysis confirms EpicEvo's superior ability to generate stories featuring new characters, both alone and interacting with existing characters. Ablation studies demonstrate the effectiveness of both the adversarial character alignment and knowledge distillation components of EpicEvo.	The paper primarily focuses on visual similarity metrics like FID, CLIP-I, and CLIP-T, acknowledging the need for further investigation into human perception of story coherence and character integration. Future work could explore expanding the NewEpisode benchmark with more diverse datasets and evaluating the generalization ability of EpicEvo to characters with even fewer example images.	generative diffusion model, story visualization, generative model customization, character consistency, few-shot learning
2405.11794 Report	ViViD: Video Virtual Try-on using Diffusion Models	Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha	Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.	This paper presents ViViD, a novel framework leveraging diffusion models for video virtual try-on, and introduces a new large-scale, diverse dataset for this task.	Current video virtual try-on methods suffer from limitations such as temporal inconsistency, low visual quality, and lack of diverse training data, hindering their real-world application.	ViViD utilizes a Garment Encoder with attention feature fusion to capture fine-grained clothing details, a Pose Encoder for spatial-temporal consistency, and temporal modules for coherent video synthesis. It is trained with an image-video joint strategy on a newly collected dataset.	ViViD outperforms existing methods in generating high-quality try-on videos with better temporal consistency and detail preservation. The proposed Garment Encoder and attention feature fusion mechanism effectively capture and integrate fine-grained clothing details into the generated videos. The image-video joint training strategy proves beneficial in learning both detailed clothing representation and temporal dynamics.	The current model does not generalize well to videos with extreme poses or rapid movements. Future work can explore incorporating user-specific features and preferences for personalized try-on experiences.	video virtual try-on, diffusion models, temporal consistency, garment encoder, dataset
2405.11685 Report	ColorFoil: Investigating Color Blindness in Large Vision and Language Models	Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee	With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.	This paper introduces ColorFoil, a novel Vision and Language (V&L) benchmark, to assess the ability of V&L models to perceive and identify color attributes.	This work investigates the robustness and generalizability of V&L models in perceiving colors, a crucial aspect of human-like visual understanding, essential for real-world applications.	ColorFoil is constructed by creating color-related foils from MS COCO and Flickr30k datasets. The model's ability to distinguish between original captions and color-foiled versions is evaluated using accuracy and F1-score.	BridgeTower and ViLT models demonstrate superior color perception compared to CLIP and its variants, as well as GroupViT. CLIP-based models and GroupViT struggle to differentiate colors easily distinguishable by humans. Model performance degrades with an increase in the number of foils, highlighting a challenge in handling complex scenarios.	The selection of 10 common colors for foils is subjective and might not represent the full spectrum of frequently used colors. Future work includes expanding the benchmark to assess robustness in other areas like gender, size, emotions, and negation.	vision and language, v&l models, color perception, benchmarking, robustness
2405.11616 Report	Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention	Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo	In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods.	Era3D, a novel multiview diffusion method that generates high-resolution multiview images from single-view images by addressing camera prior mismatch, inefficacy, and low resolution in existing methods.	Existing multiview generation methods suffer from limitations like camera prior mismatch, inefficacy, and low resolution, leading to poor-quality multiview images and hindering high-quality 3D reconstruction.	Era3D uses different camera models for input (arbitrary) and generated images (orthogonal with fixed viewpoints) and employs a camera prediction module to estimate focal length and elevation. It introduces row-wise attention for efficient cross-view information fusion.	Generates high-quality, consistent multiview images and normal maps at resolutions up to 512x512. Successfully mitigates distortion artifacts caused by inconsistent camera intrinsics. Achieves state-of-the-art performance for single-view 3D generation.	Struggles to generate intricate geometries and open meshes due to sparse multiview generation. Reliance on Neural SDF limits reconstruction of meshes with open surfaces.	multiview diffusion, 3d reconstruction, row-wise attention, camera canonicalization, single-view 3d generation
2405.11523 Report	Diffusion-Based Hierarchical Image Steganography	Youmin Xu, Xuanyu Zhang, Jiwen Yu, Chong Mou, Xiandong Meng, Jian Zhang	This paper introduces Hierarchical Image Steganography, a novel method that enhances the security and capacity of embedding multiple images into a single container using diffusion models. HIS assigns varying levels of robustness to images based on their importance, ensuring enhanced protection against manipulation. It adaptively exploits the robustness of the Diffusion Model alongside the reversibility of the Flow Model. The integration of Embed-Flow and Enhance-Flow improves embedding efficiency and image recovery quality, respectively, setting HIS apart from conventional multi-image steganography techniques. This innovative structure can autonomously generate a container image, thereby securely and efficiently concealing multiple images and text. Rigorous subjective and objective evaluations underscore our advantage in analytical resistance, robustness, and capacity, illustrating its expansive applicability in content safeguarding and privacy fortification.	This paper proposes Hierarchical Image Steganography (HIS), a novel method using diffusion models to embed multiple images into a single container image with varying levels of robustness based on image importance.	Existing multi-image steganography methods lack robustness and don't differentiate between the importance of embedded images, making them vulnerable to degradation.	HIS employs a tiered embedding strategy using diffusion models for robust embedding of important images (Tier-1) and flow models for high-capacity embedding of less important images (Tier-2). It further integrates Embed-Flow and Enhance-Flow to improve embedding efficiency and image recovery quality.	HIS demonstrates superior robustness against various distortions, ensuring integrity of important images. The tiered embedding strategy allows for high-capacity embedding while maintaining significant robustness. HIS exhibits outstanding statistical security, effectively confusing steganalysis tools.	The recovery quality of Tier-2 images degrades with an increasing number of embedded images. Local tampering on the container image can lead to information loss in Tier-2 images.	steganography, diffusion models, image hiding, robustness, security
2405.11473 Report	FIFO-Diffusion: Generating Infinite Videos from Text without Training	Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han	We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines.	FIFO-Diffusion, a novel inference technique based on pretrained diffusion models for generating infinitely long videos without additional training.	Long video generation remains challenging for diffusion-based models due to computational costs and limitations in capturing long-term temporal context.	FIFO-Diffusion utilizes diagonal denoising, processing consecutive frames with increasing noise levels in a queue. It incorporates latent partitioning to reduce training-inference gap and lookahead denoising to enhance noise prediction accuracy.	FIFO-Diffusion can generate extremely long videos (over 10,000 frames) without quality degradation, relying solely on models trained with short clips. It produces videos with natural and consistent motion by propagating temporal context throughout the generation process. Qualitative comparisons and user study show that FIFO-Diffusion significantly outperforms other training-free long video generation methods.	Training-inference gap remains due to the change in input distribution induced by diagonal denoising. Future work includes integrating diagonal denoising into the training process to further improve the performance.	text-to-video generation, diffusion models, long video generation, diagonal denoising, latent partitioning
2405.11467 Report	AdaAugment: A Tuning-Free and Adaptive Approach to Enhance Data Augmentation	Suorong Yang, Peijia Li, Xin Xiong, Furao Shen, Jian Zhao	Data augmentation (DA) is widely employed to improve the generalization performance of deep models. However, most existing DA methods use augmentation operations with random magnitudes throughout training. While this fosters diversity, it can also inevitably introduce uncontrolled variability in augmented data, which may cause misalignment with the evolving training status of the target models. Both theoretical and empirical findings suggest that this misalignment increases the risks of underfitting and overfitting. To address these limitations, we propose AdaAugment, an innovative and tuning-free Adaptive Augmentation method that utilizes reinforcement learning to dynamically adjust augmentation magnitudes for individual training samples based on real-time feedback from the target network. Specifically, AdaAugment features a dual-model architecture consisting of a policy network and a target network, which are jointly optimized to effectively adapt augmentation magnitudes. The policy network optimizes the variability within the augmented data, while the target network utilizes the adaptively augmented samples for training. Extensive experiments across benchmark datasets and deep architectures demonstrate that AdaAugment consistently outperforms other state-of-the-art DA methods in effectiveness while maintaining remarkable efficiency.	This paper proposes AdaAugment, a novel adaptive data augmentation method that uses reinforcement learning to dynamically adjust augmentation magnitudes for individual training samples based on real-time feedback from the target network.	Existing data augmentation methods often employ random or predefined augmentation magnitudes, leading to potential misalignment with the evolving training status of deep models and increasing the risks of underfitting and overfitting.	AdaAugment utilizes a dual-model architecture with a policy network and a target network. The policy network learns to determine optimal augmentation magnitudes based on real-time feedback from the target network, which is simultaneously trained using the adaptively augmented data.	AdaAugment consistently outperforms state-of-the-art data augmentation methods across benchmark datasets (CIFAR-10/100, Tiny-ImageNet) and deep architectures. AdaAugment demonstrates improved model transferability in transfer learning settings. Complexity analysis reveals that AdaAugment incurs minimal parameter and computational overhead, highlighting its efficiency.	The current study focuses on image classification tasks, future work can explore AdaAugment's applicability to other domains. Future research can investigate the generalization of AdaAugment to a broader range of tasks beyond image classification.	data augmentation, reinforcement learning, deep learning, image classification, adaptive methods
2405.11442 Report	Unifying 3D Vision-Language Understanding via Promptable Queries	Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li	A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.	This paper introduces PQ3D, a unified model using Promptable Queries to manage various 3D scene representations, prompts, and outputs for numerous 3D vision-language (3D-VL) tasks.	A unified model for 3D scene understanding is crucial for embodied agents to understand and execute human instructions in real-world scenarios, bridging the gap between low-level instance segmentation and high-level reasoning.	PQ3D unifies point cloud, voxel, and multi-view image features into a shared 3D space, employs an attention-based query decoder for task-specific information retrieval guided by prompts, and utilizes universal output heads for predicting instance masks, task-relevance scores, and textual responses.	PQ3D achieves state-of-the-art results on ten diverse 3D-VL datasets, setting new records on most benchmarks, including ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap. The model demonstrates strong zero-shot capability with novel prompt types, such as using image sketches for object localization. PQ3D shows promising results in embodied navigation and task planning, highlighting its potential as a fundamental 3D encoding module for embodied agents.	The model's performance on tail classes in instance segmentation is less robust due to biases in the CLIP text encoder. PQ3D's ability to handle complex spatial relations and long sentences in visual grounding and question answering can be further improved.	3d vision-language understanding, promptable queries, unified model, embodied ai, multi-task learning
2405.11286 Report	Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion	Zeyu Zhang, Yiran Wang, Biao Wu, Shuo Chen, Zhiyuan Zhang, Shiya Huang, Wenbo Zhang, Meng Fang, Ling Chen, Yang Zhao	In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community. See project website https://steve-zeyu-zhang.github.io/MotionAvatar/	This paper introduces Motion Avatar, an LLM agent-based method for generating customizable human and animal avatars with motions based on text input.	Current methods struggle to integrate 3D avatar mesh generation and motion generation, especially for animals due to data scarcity. This work bridges this gap and enables customizable avatar creation with realistic motions.	The approach leverages an LLM planner to process user queries and generate prompts for motion (using MoMask) and 3D mesh generation (using Stable Diffusion XL and TripoSR). It also introduces Zoo-300K, a new animal motion dataset with 300,000 text-motion pairs across 65 animal categories, created using the ZooGen pipeline.	The LLM planner effectively extracts motion and avatar categories from user input and generates appropriate prompts for downstream generation. Motion Avatar generates high-quality and customizable human and animal avatars with realistic motions from text descriptions. The Zoo-300K dataset and ZooGen pipeline provide valuable resources for future research on animal motion generation.	Quantitative evaluation of animal motion generation is still in progress and will be included in the next revision. Future work will focus on enhancing the LLM planner's generalization ability to encompass broader dynamic avatar generation tasks.	text-to-motion generation, 3d avatar generation, llm agent, animal motion dataset, customizable avatar
2405.11273 Report	Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang	Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization. Our findings highlight the substantial potential of MoE frameworks in advancing MLLMs and the code is available at https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.	This paper introduces Uni-MoE, a novel unified Multimodal Large Language Model (MLLM) that leverages the Mixture of Experts (MoE) architecture for efficient scaling and handling of various modalities such as video, image, text, audio, and speech.	Scaling up MLLMs incurs high computational costs. Uni-MoE addresses this by activating only a subset of expert parameters per input, improving efficiency in training and inference.	Uni-MoE uses modality-specific encoders and connectors to map inputs into a unified language representation. A sparse MoE layer within the LLM allows for selective expert activation. The model is trained in three stages: cross-modality alignment, modality-specific expert training, and unified MoE training with mixed multimodal data.	Uni-MoE outperforms dense MLLMs on various benchmarks, demonstrating advantages in handling complex out-of-domain tasks, particularly long speech understanding and reasoning. The model exhibits less performance bias across different modalities compared to dense models, even when trained on unbalanced mixed-modality data. Pre-training experts on individual modalities enhances multi-expert collaboration and generalization compared to standard MoE tuning with identical initial expert parameters.	Fully converting all layers to MoE does not necessarily yield the best performance and requires longer training. Further exploration of more robust and efficient MoE architectures is needed for larger MLLMs.	mixture of experts, multimodal large language model, unified framework, multimodal learning, cross-modal reasoning
2405.11252 Report	Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching	Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan	In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}.	This paper introduces Dreamer XL, a novel text-to-3D generation method that leverages Trajectory Score Matching (TSM) and Stable Diffusion XL for high-quality and consistent 3D content creation.	Existing text-to-3D methods suffer from limitations such as over-smoothing, low resolution, and inconsistencies in generated results. This work aims to address these issues and enhance the realism and detail of generated 3D content.	The proposed TSM method utilizes dual paths during the DDIM inversion process to minimize accumulated errors and improve consistency. Additionally, the work incorporates Stable Diffusion XL for high-resolution guidance and introduces a pixel-by-pixel gradient clipping method to address gradient instability issues.	Dreamer XL generates high-quality 3D content with realistic appearances and avoids over-smoothing and oversaturation. Compared to state-of-the-art methods, Dreamer XL demonstrates superior visual quality and consistency, as evidenced by qualitative comparisons and quantitative metrics such as CLIP-Score and A-LPIPS. Ablation studies confirm the effectiveness of the proposed TSM and gradient clipping techniques in enhancing the quality and consistency of the generated 3D models.	The method exhibits limitations in handling light, particularly with anomalous blue reflections observed in generated scenes, potentially attributed to SDXL. The advancements in 3D model generation might be misused for malicious purposes like deepfakes.	text-to-3d generation, trajectory score matching, stable diffusion xl, 3d gaussian splatting, deep learning
2405.11236 Report	TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation	Chengcheng Feng, Mu He, Qiuyu Tian, Haojie Yin, Xiaofang Zhao, Hongwei Tang, Xingqiang Wei	As deep learning technology continues to advance, image generation models, especially models like Stable Diffusion, are finding increasingly widespread application in visual arts creation. However, these models often face challenges such as overfitting, lack of stability in generated results, and difficulties in accurately capturing the features desired by creators during the fine-tuning process. In response to these challenges, we propose an innovative method that integrates Singular Value Decomposition (SVD) into the Low-Rank Adaptation (LoRA) parameter update strategy, aimed at enhancing the fine-tuning efficiency and output quality of image generation models. By incorporating SVD within the LoRA framework, our method not only effectively reduces the risk of overfitting but also enhances the stability of model outputs, and captures subtle, creator-desired feature adjustments more accurately. We evaluated our method on multiple datasets, and the results show that, compared to traditional fine-tuning methods, our approach significantly improves the model's generalization ability and creative flexibility while maintaining the quality of generation. Moreover, this method maintains LoRA's excellent performance under resource-constrained conditions, allowing for significant improvements in image generation quality without sacrificing the original efficiency and resource advantages.	Introduces TriLoRA, an innovative method integrating Singular Value Decomposition (SVD) into the Low-Rank Adaptation (LoRA) framework for enhanced fine-tuning of text-to-image generation models.	Addresses challenges in existing models like overfitting, output instability, and difficulty capturing nuanced style features during fine-tuning.	Incorporates SVD within LoRA to create a triple low-rank matrix representation, enabling more precise control over feature integration during model training.	Demonstrates superior visual quality and stability in generated images compared to traditional LoRA. Shows greater resistance to overfitting, particularly during extended training periods. Exhibits improved performance in user studies, achieving higher scores in textual-visual consistency and visual appeal.	Increased model complexity leading to longer convergence times, requiring more training epochs. Performance improvement is limited by the quality of the pre-trained model used as a foundation.	text-to-image generation, stable diffusion, fine-tuning, low-rank adaptation (lora), singular value decomposition (svd)
2405.11190 Report	ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing	Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, Dahua Lin	Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhibit a deficiency in executing active reasoning capacities required to comprehend instructions that are implicit or insufficiently defined. To enhance active reasoning capabilities and impart intelligence to the editing model, we introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset. The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images. When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not. The code, model, and dataset will be publicly available.	This paper introduces ReasonPix2Pix, a dataset for instruction-based image editing focusing on reasoning abilities, and proposes a simple framework incorporating a multi-modal large language model (MLLM) with a diffusion model to improve image editing with reasoning instructions.	Existing instruction-based image editing models often lack active reasoning capabilities, failing to understand implicit or insufficiently defined instructions. This paper addresses this by enabling models to understand the intent behind instructions rather than just recognizing keywords.	The authors create ReasonPix2Pix dataset by generating reasoning instructions for image pairs from existing datasets and generating new image pairs with reasoning instructions. They then fine-tune a framework with an MLLM and a diffusion model on this dataset.	The proposed method demonstrates superior performance in instruction editing tasks, both with and without reasoning requirements. The model successfully handles complex instructions and generates high-quality edited images. Analysis confirms the importance of the proposed dataset and the effectiveness of integrating MLLM for improving image editing with reasoning.	The dataset size is limited due to API costs, although researchers can expand it using the provided pipeline. Future work could explore more complex reasoning scenarios and further enhance the model's ability to handle abstract instructions.	image editing, instruction-based editing, reasoning, multi-modal large language model, diffusion model
2405.11135 Report	AquaLoRA: Toward White-box Protection for Customized Stable Diffusion Models via Watermark LoRA	Weitao Feng, Wenbo Zhou, Jiyan He, Jie Zhang, Tianyi Wei, Guanlin Li, Tianwei Zhang, Weiming Zhang, Nenghai Yu	Diffusion models have achieved remarkable success in generating high-quality images. Recently, the open-source models represented by Stable Diffusion (SD) are thriving and are accessible for customization, giving rise to a vibrant community of creators and enthusiasts. However, the widespread availability of customized SD models has led to copyright concerns, like unauthorized model distribution and unconsented commercial use. To address it, recent works aim to let SD models output watermarked content for post-hoc forensics. Unfortunately, none of them can achieve the challenging white-box protection, wherein the malicious user can easily remove or replace the watermarking module to fail the subsequent verification. For this, we propose \texttt{\method} as the first implementation under this scenario. Briefly, we merge watermark information into the U-Net of Stable Diffusion Models via a watermark Low-Rank Adaptation (LoRA) module in a two-stage manner. For watermark LoRA module, we devise a scaling matrix to achieve flexible message updates without retraining. To guarantee fidelity, we design Prior Preserving Fine-Tuning (PPFT) to ensure watermark learning with minimal impacts on model distribution, validated by proofs. Finally, we conduct extensive experiments and ablation studies to verify our design.	This paper introduces AquaLoRA, a novel technique to watermark customized Stable Diffusion models for white-box protection, ensuring copyright in open-source environments.	The open-source nature of Stable Diffusion models raises copyright concerns as customized models are easily redistributed without consent, necessitating robust watermarking solutions.	AquaLoRA operates in two stages: (1) It pre-trains a latent watermark, optimizing robustness and fidelity with a novel Peak Regional Variation Loss. (2) It uses a scaling matrix within a Low-Rank Adaptation (LoRA) module for flexible watermark embedding and a prior preserving fine-tuning method to minimize visual impact on generated images.	AquaLoRA achieves high fidelity, with negligible impact on image quality compared to original models. The method exhibits robustness against various image distortions, sampling configurations, and the use of add-ons like ControlNet and LoRA. AquaLoRA provides flexibility, allowing for easy modification of the embedded watermark without retraining.	The current method faces limitations in handling heavy cropping and rotation distortions. Future work will focus on extending AquaLoRA's protection to editing, inpainting, and outpainting functionalities. The performance degradation with larger output image sizes requires further investigation and improvement.	stable diffusion, watermarking, copyright protection, white-box protection, generative ai
2405.11129 Report	MotionGS : Compact Gaussian Splatting SLAM by Motion Filter	Xinli Guo, Peng Han, Weidong Zhang, Hongtian Chen	With their high-fidelity scene representation capability, the attention of SLAM field is deeply attracted by the Neural Radiation Field (NeRF) and 3D Gaussian Splatting (3DGS). Recently, there has been a Surge in NeRF-based SLAM, while 3DGS-based SLAM is sparse. A novel 3DGS-based SLAM approach with a fusion of deep visual feature, dual keyframe selection and 3DGS is presented in this paper. Compared with the existing methods, the proposed selectively tracking is achieved by feature extraction and motion filter on each frame. The joint optimization of pose and 3D Gaussian runs through the entire mapping process. Additionally, the coarse-to-fine pose estimation and compact Gaussian scene representation are implemented by dual keyfeature selection and novel loss functions. Experimental results demonstrate that the proposed algorithm not only outperforms the existing methods in tracking and mapping, but also has less memory usage.	MotionGS, a novel dense 3D Gaussian Splatting (3DGS)-based SLAM approach that combines deep visual features, a dual keyframe selection strategy, and 3DGS for accurate real-time tracking and high-fidelity scene reconstruction.	Existing dense visual SLAM methods, including those based on NeRF, face limitations in achieving high-fidelity representation, real-time performance, and efficient memory usage. 3DGS offers a promising alternative with faster optimization and rendering compared to NeRF.	The approach employs a dual keyframe strategy with motion and information filters to select keyframes for tracking and mapping. A novel loss function and direct pose optimization tailored for 3DGS are introduced to refine camera poses and compactly represent the scene.	MotionGS achieves state-of-the-art tracking accuracy on both TUM RGB-D and Replica datasets, outperforming existing NeRF-based and 3DGS-based SLAM methods. It demonstrates superior rendering quality compared to baselines, capturing finer details and textures. The approach significantly reduces memory usage for map representation compared to previous 3DGS-based methods.	The lack of loop closure detection and global bundle adjustment in the monocular setting limits tracking accuracy in challenging scenarios. Future work will focus on extending the approach to multi-sensor fusion and large-scale outdoor environments.	slam, 3d gaussian splatting, dense visual slam, keyframe selection, scene representation
2405.10988 Report	Flow Score Distillation for Diverse Text-to-3D Generation	Runjie Yan, Kailu Wu, Kaisheng Ma	Recent advancements in Text-to-3D generation have yielded remarkable progress, particularly through methods that rely on Score Distillation Sampling (SDS). While SDS exhibits the capability to create impressive 3D assets, it is hindered by its inherent maximum-likelihood-seeking essence, resulting in limited diversity in generation outcomes. In this paper, we discover that the Denoise Diffusion Implicit Models (DDIM) generation process (\ie PF-ODE) can be succinctly expressed using an analogue of SDS loss. One step further, one can see SDS as a generalized DDIM generation process. Following this insight, we show that the noise sampling strategy in the noise addition stage significantly restricts the diversity of generation results. To address this limitation, we present an innovative noise sampling approach and introduce a novel text-to-3D method called Flow Score Distillation (FSD). Our validation experiments across various text-to-image Diffusion Models demonstrate that FSD substantially enhances generation diversity without compromising quality.	This paper introduces Flow Score Distillation (FSD), a novel text-to-3D generation method that leverages pre-trained 2D text-to-image Diffusion Models. FSD enhances generation diversity by introducing a new noise sampling approach within the Score Distillation Sampling (SDS) framework.	Existing SDS-based methods, while effective in generating high-quality 3D assets, suffer from limited diversity due to their inherent maximum-likelihood-seeking nature. This limitation restricts the range of generated outputs.	The paper first establishes a theoretical connection between SDS and the DDIM generation process, revealing SDS as a generalized DDIM process for 3D representations. Building upon this insight, it identifies the noise sampling strategy in SDS as the primary factor limiting diversity. FSD addresses this by employing a deterministic world-map noise function to generate coarsely aligned noise, promoting consistent optimization trajectories and enhancing diversity.	FSD significantly enhances generation diversity compared to traditional SDS-based methods, producing a wider range of 3D models from the same text prompt. The method maintains the generation quality of SDS, ensuring that the generated 3D models remain realistic and detailed. FSD achieves diversity improvement without introducing additional training costs compared to SDS.	While improving diversity, FSD still faces challenges in achieving the same level of diversity observed in 2D image generation using DDIM. The deterministic noise function in FSD, while effective, relies on manual design; exploring learned or more sophisticated noise functions could further enhance diversity.	3d generation, noise prior, diffusion models, score distillation sampling, text-to-3d
2405.10864 Report	Improving face generation quality and prompt following with synthetic captions	Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos Zafeiriou	Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.	This paper introduces a training-free pipeline to generate detailed appearance descriptions from human face images, using these descriptions to fine-tune a text-to-image diffusion model for improved realism and prompt adherence in generating human faces.	Existing text-to-image models struggle to generate realistic and accurate human faces due to the lack of detailed appearance information in typical image captions used for training.	The pipeline extracts facial features (age, gender, ethnicity, emotion, hair, etc.) from images using pre-trained models. These features are converted into natural language descriptions using an LLM (Vicuna 13B). These descriptions are then used to fine-tune a Stable Diffusion 2.1 model.	The fine-tuned model generates more realistic human faces compared to the base Stable Diffusion model. The model demonstrates better adherence to detailed prompts in generating specific facial features. The model exhibits some degree of identity preservation across different age, ethnicity, and emotion attributes.	The pipeline inherits potential biases from the pre-trained face analysis models used for feature extraction. The fine-tuned model might still exhibit biases present in the original Stable Diffusion model and the selected finetuning datasets.	text-to-image generation, diffusion models, facial image description, synthetic captions, realistic face generation
2405.10832 Report	Open-Vocabulary Spatio-Temporal Action Detection	Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, Limin Wang	Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes. For OV-STAD, we build two benchmarks based on the existing STAD datasets and propose a simple but effective method based on pretrained video-language models (VLM). To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs. This customized fine-tuning endows the VLM with better motion understanding, thus contributing to a more accurate alignment between video regions and texts. Local region feature and global video feature fusion before alignment is adopted to further improve the action detection performance by providing global context. Our method achieves a promising performance on novel classes.	This paper proposes a new setting for open-vocabulary spatio-temporal action detection (OV-STAD) and introduces a simple yet effective method using pretrained video-language models (VLMs).	OV-STAD addresses the limitations of traditional STAD methods that require extensive box annotations and retraining for new action classes, making it more practical for real-world applications with a vast and dynamic action space.	The method leverages a pretrained VLM fine-tuned on video region-text pairs to enhance local feature representation for action recognition. It also incorporates global video features for improved alignment and overfitting mitigation.	The proposed method achieves promising results on novel classes for OV-STAD. Video region-text alignment pretraining significantly enhances the model's capability for recognizing unseen action classes. Fusing global and local video features effectively improves the alignment between visual features and action prompts, benefiting action recognition.	The performance on the AVA dataset is limited, potentially due to the atomic nature of actions and reliance on object/scene cues in pretraining. The method relies on an external human detector, which might introduce errors and limit the overall performance.	spatio-temporal action detection, open vocabulary learning, video-language models, region-text alignment, zero-shot learning
2405.10674 Report	From Sora What We Can See: A Survey of Text-to-Video Generation	Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan	With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.	This paper presents a comprehensive survey of text-to-video (T2V) generation, offering a structured analysis of current research inspired by the capabilities of OpenAI's Sora.	Sora represents a significant leap in T2V technology, demonstrating the potential for generating realistic and imaginative videos from textual descriptions, thus necessitating a focused review of this rapidly evolving field.	The authors categorize T2V generation techniques based on the evolution of generative models (GAN/VAE, autoregressive, diffusion-based), essential video qualities (duration, resolution, quality), and realism components (motion, scenes, objects, layout). They also review commonly used datasets and evaluation metrics.	Sora, while advanced, still exhibits limitations in generating realistic motion, consistent object appearances, and accurate physical interactions, highlighting ongoing challenges in T2V research. The survey identifies key areas for future development, including robot learning from visual assistance, infinite 3D scene reconstruction, augmented digital twins, and the establishment of ethical and normative frameworks for AI applications. Existing T2V techniques have achieved significant progress in generating longer, higher-resolution, and smoother videos, but challenges remain in seamlessly integrating complex elements and ensuring realism.	The survey primarily focuses on Sora's capabilities, potentially overlooking advancements in other T2V models. The rapid evolution of the field may lead to new breakthroughs and challenges not fully addressed in the current review.	text-to-video generation, sora, diffusion models, generative ai, video synthesis
2405.10577 Report	DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection	Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge	Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.	DuoSpaceNet, a novel camera-based 3D perception framework for autonomous driving, leverages both bird's-eye-view (BEV) and perspective-view (PV) features to enhance 3D object detection and map segmentation.	Existing methods rely on either BEV or PV features, each with limitations. DuoSpaceNet bridges the gap, combining strengths of both representations for superior performance.	DuoSpaceNet uses a duo space decoder with space-specific cross-attention layers to process and fuse BEV and PV features. It employs feature divergence enhancement for inter-space distinctiveness and a novel temporal modeling method for multi-frame settings.	Achieves state-of-the-art 3D object detection results on nuScenes dataset, outperforming both BEV-based and PV-based methods. Demonstrates superior map segmentation performance, achieving highest IoU for drivable area and lane boundaries. Ablation studies confirm the effectiveness of each proposed component, highlighting the synergy of duo space features, feature divergence enhancement, and temporal modeling.	Computational cost of feature divergence enhancement can be high. Long-range detection capabilities are not fully explored due to the limitations of current datasets.	3d object detection, autonomous driving, multi-view perception, "birds-eye-view (bev)", perspective view (pv)
2405.10508 Report	ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation	Pengzhi Li, Chengshuai Tang, Qinxuan Huang, Zhiheng Li	In this paper, we explore the existing challenges in 3D artistic scene generation by introducing ART3D, a novel framework that combines diffusion models and 3D Gaussian splatting techniques. Our method effectively bridges the gap between artistic and realistic images through an innovative image semantic transfer algorithm. By leveraging depth information and an initial artistic image, we generate a point cloud map, addressing domain differences. Additionally, we propose a depth consistency module to enhance 3D scene consistency. Finally, the 3D scene serves as initial points for optimizing Gaussian splats. Experimental results demonstrate ART3D's superior performance in both content and structural consistency metrics when compared to existing methods. ART3D significantly advances the field of AI in art creation by providing an innovative solution for generating high-quality 3D artistic scenes.	Introduces ART3D, a novel framework for generating high-quality 3D artistic scenes from text descriptions or reference images using diffusion models and 3D Gaussian splatting.	Addresses the limitations of existing 3D art creation methods, particularly in bridging the domain gap between artistic and realistic images and ensuring global scene consistency.	Employs an image semantic transfer algorithm to align the semantic information of artistic and realistic images, enabling accurate depth estimation. Uses a depth consistency module to enhance the consistency of the point cloud map across different views. Finally, optimizes a 3D Gaussian splatting representation for high-quality rendering.	Generates 3D artistic scenes with superior style consistency and continuity compared to existing methods. Effectively addresses the domain gap between artistic and realistic images, enabling accurate depth estimation and 3D reconstruction. Demonstrates improved global scene consistency through the depth consistency module, resulting in more coherent and visually appealing 3D scenes.	Relies on monocular depth estimation, which may have limitations in capturing complex scene geometry. Limited exploration of dynamic scene generation.	3d scene generation, diffusion models, gaussian splatting, ai art, text-to-3d
2405.10370 Report	Grounded 3D-LLM with Referent Tokens	Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang	Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io.	This paper introduces Grounded 3D-LLM, a novel framework that uses "referent tokens" to represent scene regions or object features, enabling the integration of diverse 3D vision tasks within a unified generative language modeling framework.	Existing 3D scene understanding models are often task-specific and lack generalizability. Grounded 3D-LLM addresses this limitation by offering a unified approach to handle various tasks, such as object detection, grounding, captioning, and question answering, within a single model.	The proposed framework utilizes two main steps: (1) Contrastive Language-Scene Pre-training (CLASP) aligns point cloud features with textual phrases at a granular level. (2) Multi-task instruction tuning, incorporating "referent tokens," enables the model to perform diverse 3D vision tasks based on textual instructions.	Grounded 3D-LLM outperforms previous generative models in most 3D vision tasks, showcasing its potential as a unified framework. CLASP demonstrates superior performance in 3D grounding and detection benchmarks, highlighting its ability to align textual phrases with 3D scene regions effectively. Automated generation of a large-scale, grounded language dataset, G-SceneCap, contributes to the model's performance and offers a valuable resource for future research.	While promising, Grounded 3D-LLM shows performance gaps compared to the pre-trained CLASP, suggesting further improvement in bridging discriminative and generative approaches. The model primarily focuses on indoor scenarios and may exhibit limitations in handling complex real-world environments or generating entirely accurate language outputs.	3d vision, large language models, vision-language models, scene understanding, generative modeling
2405.10320 Report	Toon3D: Seeing Cartoons from a New Perspective	Ethan Weber, Riley Peterlinz, Rohan Mathur, Frederik Warburg, Alexei A. Efros, Angjoo Kanazawa	In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before. Our project page is https://toon3d.studio .	Presents Toon3D, a pipeline for reconstructing the 3D structure of non-geometrically consistent scenes, focusing on hand-drawn images from cartoons and anime.	Addresses the challenge of reconstructing 3D from hand-drawn images that lack geometric consistency, a problem that traditional SfM pipelines struggle with.	Uses a three-step process: (1) sparse alignment of user-annotated correspondences backprojected using monocular depth, (2) dense alignment with 2D image and 3D depth warping, and (3) refinement using Gaussian Splatting.	Successfully recovers camera poses and dense 3D structure from various cartoon scenes, enabling novel view synthesis. Reveals geometric inconsistencies in hand-drawn images through the process of warping them to fit a perspective camera model. Demonstrates applicability beyond cartoons by reconstructing scenes from sparse photo collections and paintings.	Reliance on accurate user-provided correspondences and depth predictions. Limited exploration of end-to-end learning-based approaches for cartoon reconstruction.	3d reconstruction, non-geometric modeling, cartoon analysis, sparse-view reconstruction, image warping
2405.10317 Report	Text-to-Vector Generation with Neural Path Representation	Peiying Zhang, Nanxuan Zhao, Jing Liao	Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is https://intchous.github.io/T2V-NPR.	This paper presents a novel text-to-vector (T2V) generation pipeline that generates high-quality vector graphics from text prompts, ensuring geometric regularity and layer-wise structure.	Existing T2V methods either rely on image vectorization of raster T2I results, leading to complex and inaccurate vectors, or directly optimize control points, resulting in intersecting and jagged paths. This work addresses these limitations.	The method uses a dual-branch VAE to learn a neural path representation capturing geometric properties. A two-stage optimization process then refines a set of paths: first with VSD based on a pre-trained diffusion model for text alignment, and then with a layer-wise strategy for clarity and structure.	The method outperforms existing approaches in generating high-quality and diverse vector graphics with valid paths and layer properties. It offers control over details and style, and enables applications like SVG customization, image-to-SVG generation, and animation. User study confirms its superiority in overall SVG quality and alignment with text prompts.	The method's reliance on diffusion models may lead to inaccuracies in representing highly detailed prompts. The current path latent space struggles to capture intricate boundaries, leading to over-simplification of complex shapes.	vector graphics, svg, text-to-vector generation, diffusion model, neural path representation
2405.10316 Report	Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model	Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao	Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.	This paper presents a comprehensive survey of in-context learning (ICL) with a specific focus on its application in computer vision.	In-context learning is gaining increasing attention as it enables learning and adaptation without explicit parameter updates, holding promise for more flexible and data-efficient machine learning.	The paper reviews the evolution of ICL, examines its principles, analyzes various ICL approaches within different visual learning tasks, and discusses promising future directions.	The paper provides a taxonomy of ICL, categorizing it into different types and highlighting their strengths and weaknesses. It offers an in-depth analysis of ICL applications across diverse vision tasks, including image generation, image editing, and video processing. The paper identifies key challenges and opportunities associated with ICL in computer vision, pointing towards areas for future research.	The paper acknowledges that ICL is a rapidly evolving field and the survey might not encompass the very latest developments. Further exploration of benchmarks and evaluation metrics tailored for ICL in computer vision is suggested.	in-context learning, computer vision, survey, image generation, image editing
2405.10314 Report	CAT3D: Create Anything in 3D with Multi-View Diffusion Models	Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, Ben Poole	Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .	CAT3D is a method for creating 3D scenes from any number of generated or real images by simulating a real-world capture process with a multi-view diffusion model.	Creating 3D content typically requires dense multi-view capture, which is time-consuming and limits accessibility. CAT3D enables 3D creation from limited input, such as a single image or text prompt.	CAT3D first generates consistent novel views from input images using a multi-view diffusion model. These views are then used for robust 3D reconstruction with a modified NeRF pipeline.	CAT3D produces high-quality 3D scenes in as little as one minute. It outperforms existing methods for single-image and few-view 3D scene creation on multiple benchmarks. The method effectively handles various input modalities, including text prompts, single images, and sparse multi-view captures.	The trained model cannot handle cases with varying camera intrinsics across input views. Generation quality depends on the expressivity of the base text-to-image model, potentially limiting performance on out-of-distribution content.	3d reconstruction, novel view synthesis, diffusion models, multi-view consistency, nerf
2405.10305 Report	4D Panoptic Scene Graph Generation	Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu	We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.	This paper introduces 4D Panoptic Scene Graph (PSG-4D), a novel representation bridging raw visual data in dynamic environments with high-level visual understanding by abstracting sensory data into nodes (entities with location and status) and edges (temporal relations).	Current scene understanding methods lack the ability to integrate dynamic, spatio-temporal relationships crucial for AI agents to interact with the real world. PSG-4D aims to overcome this by capturing the dynamic 4D nature of the environment.	The authors propose PSG4DFormer, a two-stage framework. Stage 1 performs 4D panoptic segmentation, tracking objects over time. Stage 2 leverages a spatial-temporal transformer to model relations between tracked objects and generate the 4D scene graph.	RGB-D video sequences as input generally yield better results than point cloud sequences. Incorporating depth information significantly improves performance in 4D scene graph generation. Temporal attention is crucial for capturing the dynamic relationships between objects in the scene.	Current models are limited to simple scenes and struggle with complex real-world environments. There is a need for larger and more diverse datasets for training and evaluation of 4D scene graph generation models.	4d scene understanding, scene graph generation, panoptic segmentation, spatial-temporal transformer, robot vision
2405.10300 Report	Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection	Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang	This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API	The paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models including a high-performance model (Grounding DINO 1.5 Pro) and an efficient model optimized for edge devices (Grounding DINO 1.5 Edge).	The models aim to advance the state-of-the-art in open-set object detection, providing stronger generalization and faster inference speed for wider real-world application.	Grounding DINO 1.5 leverages a dual-encoder-single-decoder structure, incorporating a larger Vision Transformer backbone (ViT-L) for the Pro model and an efficient feature enhancer for the Edge model. Both models are trained on a large-scale dataset (Grounding-20M) with over 20 million images and grounding annotations.	Grounding DINO 1.5 Pro achieves state-of-the-art performance on COCO and LVIS zero-shot benchmarks, surpassing previous methods significantly. Grounding DINO 1.5 Edge, optimized with TensorRT, reaches a speed of 75.2 FPS while attaining a competitive zero-shot performance of 36.2 AP on LVIS-minival, demonstrating its suitability for edge computing. Both models showcase robust detection capabilities in various scenarios, including common object detection, long-tailed object detection, dense object detection, and video object detection.	The paper acknowledges limitations in the quality of category names within the ODinW benchmark. Future work could explore the model's capabilities in real-time video object detection and further optimize its performance on edge devices with more limited computational resources.	open-set object detection, grounding dino, vision transformer, edge computing, zero-shot learning
2405.10185 Report	DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data	Chengxiang Fan, Muzhi Zhu, Hao Chen, Yang Liu, Weijia Wu, Huaqi Zhang, Chunhua Shen	Instance segmentation is data-hungry, and as model capacity increases, data scale becomes crucial for improving the accuracy. Most instance segmentation datasets today require costly manual annotation, limiting their data scale. Models trained on such data are prone to overfitting on the training set, especially for those rare categories. While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation, these approaches do not efficiently harness the full potential of generative models. To address these issues, we introduce a more efficient strategy to construct generative datasets for data augmentation, termed DiverGen. Firstly, we provide an explanation of the role of generative data from the perspective of distribution discrepancy. We investigate the impact of different data on the distribution learned by the model. We argue that generative data can expand the data distribution that the model can learn, thus mitigating overfitting. Additionally, we find that the diversity of generative data is crucial for improving model performance and enhance it through various strategies, including category diversity, prompt diversity, and generative model diversity. With these strategies, we can scale the data to millions while maintaining the trend of model performance improvement. On the LVIS dataset, DiverGen significantly outperforms the strong model X-Paste, achieving +1.1 box AP and +1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare categories.	This paper proposes DiverGen, an efficient strategy for constructing generative datasets to augment instance segmentation datasets and enhance model performance.	Instance segmentation models are data-hungry and existing datasets are limited by costly manual annotation. While generative models offer a solution, current methods don't fully utilize their potential or address the distribution discrepancy between real and generative data.	The paper analyzes the role of generative data from a distribution discrepancy perspective, finding that it expands the data distribution learnable by the model and mitigates overfitting. It proposes DiverGen, enhancing data diversity via category diversity (using LVIS and ImageNet categories), prompt diversity (ChatGPT generated prompts), and generative model diversity (Stable Diffusion and DeepFloyd-IF). It also optimizes the generation pipeline with SAM-background annotation and CLIP inter-similarity filtration.	Data diversity is more crucial than quantity for generative data augmentation. DiverGen outperforms previous methods, including X-Paste, on the LVIS dataset, demonstrating significant improvement in box and mask AP, particularly for rare categories. Ablation studies validate the effectiveness of individual components like category diversity, prompt diversity, generative model diversity, SAM-background, and CLIP inter-similarity.	The improvement from using extra categories plateaus and even declines slightly with too many, suggesting a balance is needed. The computational cost of using ChatGPT for prompt generation is a limitation, addressed by applying it to a subset of categories.	instance segmentation, generative data augmentation, data diversity, distribution discrepancy, long-tailed recognition
2405.10140 Report	Libra: Building Decoupled Vision System on Large Language Models	Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu	In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.	This paper introduces Libra, a new multimodal large language model (MLLM) that utilizes a decoupled vision system built upon a large language model (LLM). This approach separates inner-modal modeling from cross-modal interaction, leading to a more effective vision system for LLMs.	Existing MLLMs often struggle with balancing the vast knowledge capacity of LLMs with the complexities of visual understanding. This work addresses this challenge by proposing a novel vision system design specifically tailored for LLMs.	Libra employs a routed visual expert with a cross-modal bridge module. The visual expert, integrated into a pretrained LLM, allows for separate processing of visual and language information. The cross-modal bridge facilitates interaction between these modalities during attention computations. Libra is trained using discrete auto-regressive modeling with a hybrid image tokenization strategy that leverages contiguous visual signals and pretrained visual knowledge from a CLIP-based image tokenizer.	Libra achieves strong performance on a variety of vision-language tasks, including visual question answering and image captioning, outperforming several larger models despite using less training data. Analysis of Libra's attention patterns reveals increased diversity across layers compared to traditional MLLMs, indicating reduced learning redundancy and improved cross-modal comprehension. The decoupled vision system in Libra exhibits strong performance on benchmarks designed to detect CLIP bias, highlighting its ability to learn unique visual representations beyond simple modality alignment.	The routed visual expert introduces new attention mechanisms not yet fully supported by existing acceleration frameworks, limiting training and inference efficiency. As Libra's design is based on pretrained LLMs, it inherits limitations associated with these models, including potential hallucinations and difficulties in handling long sequences.	multimodal large language model, decoupled vision system, vision-language comprehension, discrete auto-regressive modeling, cross-modal interaction
2405.10053 Report	SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection	Mingxuan Liu, Tyler L. Hayes, Elisa Ricci, Gabriela Csurka, Riccardo Volpi	Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.	This paper introduces SHiNe, a novel training-free classifier that leverages semantic hierarchies to enhance the robustness of open-vocabulary object detectors (OVOD) to diverse vocabulary granularities.	Existing OVOD detectors show significant performance variability when handling vocabularies with different semantic granularities, posing challenges for real-world deployment.	SHiNe retrieves super-/sub-categories from a hierarchy for each target class, integrates them into hierarchy-aware sentences using an 'Is-A' connector, and fuses their embeddings to generate a 'nexus' classifier vector.	SHiNe consistently improves performance across various vocabulary granularities on iNat and FSOD datasets, with gains up to +31.9% in mAP50. It operates effectively with both ground-truth and LLM-generated hierarchies. SHiNe generalizes to other OVOD detectors and shows resilience to mis-specified vocabularies.	The performance gain with LLM-generated hierarchies, while significant, is not as substantial as with ground-truth hierarchies. Future work includes exploring alternative hierarchy generation methods and extending SHiNe to other open-vocabulary tasks like segmentation.	open-vocabulary object detection, semantic hierarchy, robustness, zero-shot learning, vision-language models
2405.09879 Report	Generative Unlearning for Any Identity	Juwon Seo, Sung-Hoon Lee, Tae-Young Lee, Seungjun Moon, Gyeong-Moon Park	Recent advances in generative models trained on large-scale datasets have made it possible to synthesize high-quality samples across various domains. Moreover, the emergence of strong inversion networks enables not only a reconstruction of real-world images but also the modification of attributes through various editing methods. However, in certain domains related to privacy issues, e.g., human faces, advanced generative models along with strong inversion methods can lead to potential misuses. In this paper, we propose an essential yet under-explored task called generative identity unlearning, which steers the model not to generate an image of a specific identity. In the generative identity unlearning, we target the following objectives: (i) preventing the generation of images with a certain identity, and (ii) preserving the overall quality of the generative model. To satisfy these goals, we propose a novel framework, Generative Unlearning for Any Identity (GUIDE), which prevents the reconstruction of a specific identity by unlearning the generator with only a single image. GUIDE consists of two parts: (i) finding a target point for optimization that un-identifies the source latent code and (ii) novel loss functions that facilitate the unlearning procedure while less affecting the learned distribution. Our extensive experiments demonstrate that our proposed method achieves state-of-the-art performance in the generative machine unlearning task. The code is available at https://github.com/KHU-AGI/GUIDE.	The paper introduces Generative Unlearning for Any IDEntity (GUIDE), a novel framework designed to remove the identity information associated with a single source image from pre-trained 2D or 3D GANs, addressing privacy concerns in generative models.	The advancement of GANs and inversion networks enables high-quality image synthesis and manipulation, raising privacy concerns as they can be misused to reconstruct and exploit individual identities even if the specific identity wasn't in the training data.	GUIDE consists of two main components: (1) Un-identifying Face On Latent Space (UFO), which identifies a suitable target latent code by extrapolating from the source latent code away from the average latent code, encouraging a distinct identity shift. (2) Latent Target Unlearning (LTU) utilizes three novel loss functions: local unlearning loss for direct identity shift, adjacency-aware unlearning loss for unlearning the entire identity neighborhood, and global preservation loss to maintain the generator's overall performance.	GUIDE effectively removes identities from pre-trained GANs, even for unseen, out-of-domain images. The adjacency-aware unlearning loss in GUIDE successfully generalizes identity removal to unseen images with the same identity. The global preservation loss effectively minimizes the distribution shift in generated images, preserving the overall quality and performance of the pre-trained GAN.	The paper mainly focuses on face identity removal and is evaluated on face datasets. Further research is needed to generalize GUIDE for unlearning any identity in broader domains. The current implementation requires fine-tuning the pre-trained generator for each identity to be removed. Exploring more efficient unlearning strategies without modifying the generator is a potential future direction.	generative adversarial networks, machine unlearning, privacy protection, identity removal, image synthesis
2405.09874 Report	Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion	Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji	We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io	This paper presents \modelname, a novel text-to-3D generation framework that produces high-quality 3D assets from text descriptions in just one minute.	This research is important because it addresses the limitations of existing text-to-3D generation methods, which often suffer from slow generation speed, high training costs, and a lack of 3D consistency.	The key component of \modelname is a dual-mode multi-view latent diffusion model. This model leverages a pre-trained 2D latent diffusion model (LDM) and is trained on multi-view image data. It employs a dual-mode toggling inference strategy, switching between 2D and 3D modes to balance generation speed and 3D consistency. Furthermore, an efficient texture refinement process enhances the realism of the generated 3D assets.	Significantly faster generation time (under a minute) compared to optimization-based methods while maintaining high quality. Achieves state-of-the-art performance in text alignment and aesthetic quality, as evidenced by CLIP Score and user studies. Demonstrates robust generalization capabilities, generating diverse assets from the same text prompt and handling fine-grained semantic variations.	Limited ability to generate scenes with multiple interacting objects or highly complex geometries due to reliance on single-object multi-view data and mesh rendering during refinement. Potential for future improvement by incorporating more diverse multi-view datasets, exploring more efficient 3D representations, and investigating parameter-efficient fine-tuning methods.	text-to-3d generation, latent diffusion models, multi-view diffusion, 3d neural rendering, texture refinement
2405.09818 Report	Chameleon: Mixed-Modal Early-Fusion Foundation Models	Chameleon Team	We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.	This paper introduces Chameleon, a family of early-fusion, token-based mixed-modal foundation models that can reason over and generate interleaved image-text documents.	Chameleon aims to address the limitations of existing multimodal models that often process modalities separately, hindering their ability to fully integrate information and generate complex multimodal content.	Chameleon represents both images and text as discrete tokens within a unified transformer architecture. It is trained from scratch on a massive dataset of interleaved text and image tokens (around 10 trillion). The authors also introduce architectural innovations and training techniques to overcome the challenges of stable and scalable training in this early-fusion setting.	Chameleon achieves state-of-the-art performance on various vision-language benchmarks, including image captioning and visual question answering, while maintaining competitive performance on text-only tasks. Human evaluations show that Chameleon outperforms strong baselines like Gemini-Pro and GPT-4V in generating mixed-modal responses to open-ended prompts. Chameleon demonstrates new capabilities in mixed-modal reasoning and generation, effectively handling prompts that require interleaving text and images in its responses.	The evaluation prompts used, while diverse, were crowdsourced and might not fully represent real user interactions. The absence of other native mixed-modal models limits the scope of comparative evaluation for Chameleon's novel capabilities.	multimodal learning, foundation models, tokenization, vision-language tasks, early fusion
2405.09717 Report	From NeRFs to Gaussian Splats, and Back	Siming He, Zach Osman, Pratik Chaudhari	For robotics applications where there is a limited number of (typically ego-centric) views, parametric representations such as neural radiance fields (NeRFs) generalize better than non-parametric ones such as Gaussian splatting (GS) to views that are very different from those in the training data; GS however can render much faster than NeRFs. We develop a procedure to convert back and forth between the two. Our approach achieves the best of both NeRFs (superior PSNR, SSIM, and LPIPS on dissimilar views, and a compact representation) and GS (real-time rendering and ability for easily modifying the representation); the computational cost of these conversions is minor compared to training the two from scratch.	This paper introduces a novel method for converting between implicit neural radiance fields (NeRFs) and explicit Gaussian Splatting (GS) representations, leveraging the advantages of both for robotics applications.	This is crucial for robotics as it allows for combining the superior generalization and compactness of NeRFs with the real-time rendering and easy modification capabilities of GS, particularly beneficial in sparse view scenarios common in robotics.	The approach involves training a modified NeRF to predict spherical harmonics, converting it to GS by generating a point cloud from the NeRF and initializing Gaussians, and optionally fine-tuning the GS. Conversely, GS can be converted back to NeRF by rendering training views from the GS and fitting a NeRF to these renderings.	The proposed method, termed NeRFGS, achieves comparable or better results than state-of-the-art methods like Splatfacto, especially on novel views dissimilar to training data. Conversion between representations is computationally efficient, taking only a few seconds. The approach allows for editing the scene representation by modifying the GS and converting back to NeRF, enabling dynamic scene understanding and manipulation.	The initial conversion from NeRF to GS can lead to a decrease in quality, highlighting potential for improvement in the conversion efficiency. Future work includes exploring adaptive Gaussian scales and anisotropic Gaussians for enhanced representation accuracy.	neural radiance fields, gaussian splatting, scene representation, robotics, view generalization
2405.09673 Report	LoRA Learns Less and Forgets Less	Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham	Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.	This paper presents a rigorous comparison of Low-Rank Adaptation (LoRA) and full finetuning for Llama-2 language models on challenging code and math domains.	LoRA is widely used for efficient finetuning, but its performance compared to full finetuning in demanding domains is not well-understood.	The authors finetuned Llama-2 7B and 13B models on code and math datasets using both LoRA and full finetuning. They evaluated performance on HumanEval (coding) and GSM8K (math), and measured forgetting on language understanding, world knowledge, and reasoning tasks.	LoRA consistently underperforms full finetuning in terms of accuracy and sample efficiency, especially for code. LoRA exhibits better preservation of source-domain performance (less forgetting) compared to full finetuning. Full finetuning learns weight perturbations with a rank much higher than typical LoRA configurations, challenging the assumption of low-rank updates.	The study primarily focuses on 7B and 13B models, leaving open the question of how LoRA scales with larger models. The spectral analysis does not rule out the existence of low-rank solutions for the downstream tasks.	lora, fine-tuning, large language models, code generation, math reasoning
2405.09546 Report	BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation	Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu	The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/	\method (\methodabbr) is a customizable data generation tool for systematic evaluation and understanding of computer vision models. It leverages extended 3D asset library from BEHAVIOR-1K, and a generator to create custom vision datasets with rich annotations.	Real-world datasets have limitations: limited labels, fixed data distributions, and difficulties in acquiring rare event data. Synthetic data can address these limitations but often lack realism or customizability. \methodabbr bridges the gap by offering a customizable generator for photorealistic synthetic data.	\methodabbr consists of extended BEHAVIOR-1K assets (8K+ object models, 1K scene instances) and a customizable data generator built upon OmniGibson. The generator allows for scene object randomization, physically realistic pose generation, predicate-based rich labeling, camera pose sampling, and configurable rendering.	Parametric model evaluation reveals performance variations of SOTA models (detection and segmentation) across different domain shifts (articulation, lighting, visibility, zoom, pitch). Holistic scene understanding evaluation shows consistent relative performance between models tested on \methodabbr's synthetic data and real datasets, highlighting the datasets' realism. Training a model on \methodabbr's synthetic data for object state and relation prediction demonstrates promising zero-shot transfer capability to real-world images.	The current version of \methodabbr primarily focuses on indoor scenes. The sim2real gap, although minimized, still exists and requires further investigation, potentially through improved rendering techniques or domain adaptation methods.	synthetic data generation, computer vision, model evaluation, sim2real transfer, 3d simulation
2405.09426 Report	Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images	Memoona Aziz, Umair Rehman, Muhammad Umair Danish, Katarina Grolinger	This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.	This paper introduces the Global-Local Image Perceptual Score (GLIPS), a novel image metric designed to assess the photorealistic image quality of AI-generated images, aiming for a higher alignment with human visual perception compared to traditional metrics like FID and KID.	Existing image quality metrics often fail to accurately capture and reflect human judgments of photorealism, particularly for images generated by advanced AI models. This discrepancy highlights the need for a more reliable and human-aligned metric for evaluating AI-generated images.	GLIPS leverages vision transformer-based attention mechanisms to extract and compare salient image patches, addressing the issue of structural differences between camera-captured and AI-generated images. It also incorporates Maximum Mean Discrepancy (MMD) to evaluate the global distributional similarity of deep features extracted from the images. A novel scaling strategy, the Interpolative Binning Scale (IBS), is introduced to ensure unbiased and interpretable comparison between human and metric scores. A human study was conducted to evaluate the correlation between GLIPS and human perception of photorealistic image quality.	GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores, demonstrating its effectiveness in capturing human-like perceptions of photorealism. The IBS method effectively mitigates biases introduced by traditional scaling methods, enabling a fairer and more interpretable comparison between metric outputs and human judgments. The human study confirms a strong correlation between GLIPS scores and human assessments of photorealism, validating the metric's alignment with human visual perception.	Future work will focus on optimizing the GLIPS framework by exploring different neural network architectures and refining the kernel functions used in MMD calculation to enhance its applicability across a wider range of image types and generative models. Further research will investigate the generalizability of GLIPS to other image domains beyond those tested in the study, ensuring its robustness and effectiveness across diverse datasets.	photorealistic image quality, ai-generated images, image quality assessment, vision transformer, maximum mean discrepancy
2405.09403 Report	Identity Overlap Between Face Recognition Train/Test Data: Causing Optimistic Bias in Accuracy Measurement	Haiyu Wu, Sicong Tian, Jacob Gutierrez, Aman Bhatta, Kağan Öztürk, Kevin W. Bowyer	A fundamental tenet of pattern recognition is that overlap between training and testing sets causes an optimistic accuracy estimate. Deep CNNs for face recognition are trained for N-way classification of the identities in the training set. Accuracy is commonly estimated as average 10-fold classification accuracy on image pairs from test sets such as LFW, CALFW, CPLFW, CFP-FP and AgeDB-30. Because train and test sets have been independently assembled, images and identities in any given test set may also be present in any given training set. In particular, our experiments reveal a surprising degree of identity and image overlap between the LFW family of test sets and the MS1MV2 training set. Our experiments also reveal identity label noise in MS1MV2. We compare accuracy achieved with same-size MS1MV2 subsets that are identity-disjoint and not identity-disjoint with LFW, to reveal the size of the optimistic bias. Using more challenging test sets from the LFW family, we find that the size of the optimistic bias is larger for more challenging test sets. Our results highlight the lack of and the need for identity-disjoint train and test methodology in face recognition research.	This paper investigates the optimistic bias in face recognition accuracy caused by identity overlap between training and testing datasets, demonstrating the need for identity-disjoint train/test methodology.	Current face recognition research lacks analysis of identity overlap between datasets, making it difficult to reliably compare algorithms and understand the true impact of this overlap on accuracy.	The authors reverse-engineered the overlap between MS1MV2 (training) and LFW (testing) datasets, created identity-disjoint and identity-overlapped MS1MV2 subsets, and trained six face recognition models on these subsets to compare their performance on various test sets.	46.93% of identities in LFW are also present in MS1MV2, leading to an optimistic accuracy bias. Cleaning identity label noise in MS1MV2, even without addressing identity overlap, improves accuracy. The optimistic bias due to identity overlap is more pronounced on more challenging test sets.	The study primarily focuses on the MS1MV2 training set and LFW family of test sets. Further investigation is needed to analyze identity overlap and its impact on other training and testing datasets.	face recognition, identity overlap, optimistic bias, dataset bias, evaluation methodology
2405.09266 Report	Dance Any Beat: Blending Beats with Visuals in Dance Video Generation	Xuanchen Wang, Heng Wang, Dongnan Liu, Weidong Cai	The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: https://DabFusion.github.io.	Introduces DabFusion, a novel diffusion-based model that generates dance videos directly from a still image and music, eliminating the need for joint annotations and pioneering the use of music as a condition in image-to-video synthesis.	Addresses the limitations of current music-to-dance generation methods that rely on joint sequences, resulting in less intuitive outputs and complex data collection.	Employs a two-stage approach: 1) training a latent flow auto-encoder to estimate optical flow between video frames and 2) training a U-Net-based diffusion model to generate latent flows conditioned on music (encoded by CLAP) and a starting image. Enhances rhythm alignment by incorporating beat information extracted via Librosa.	DabFusion generates high-quality dance videos comparable to state-of-the-art unconditional video generation models. Incorporating beat information significantly improves the synchronization between dance movements and music. Camera angle and distance significantly influence the quality of generated videos.	Video quality degrades with increasing length due to the accumulation of errors. Future work includes improving arbitrary-length video generation and exploring other conditioning factors like dance style descriptions.	image-to-video synthesis, music-to-dance generation, diffusion models, motion-music alignment, conditional video generation
2405.09215 Report	Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model	Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang	We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.	This paper introduces XModel-VLM, an efficient and lightweight multimodal vision language model designed for deployment on consumer-grade GPU servers.	The paper addresses the challenge of high operational costs associated with large-scale multimodal models, which hinders their widespread adoption.	The authors develop a 1B-scale language model from scratch and integrate it with a CLIP ViT-L/14 vision encoder using a simple yet effective two-layer MLP projector (XDP). The model is trained using a two-stage approach: pre-training for feature alignment and fine-tuning for instruction following.	XModel-VLM achieves comparable performance to larger models on various multimodal benchmarks despite its smaller size. The model demonstrates faster inference speeds compared to LLAVA-7B. Ablation studies highlight the effectiveness of the proposed projector design and the impact of token numbers on model performance.	Larger language models could further improve performance. Further optimization is needed for even faster inference.	vision language model, multimodal learning, efficient deployment, lightweight model, cross-modal alignment
2405.09114 Report	SOEDiff: Efficient Distillation for Small Object Editing	Qihe Pan, Zicheng Wang, Zhen Zhao, Yiming Wu, Sifan Long, Haoran Liang, Ronghua Liang	In this paper, we delve into a new task known as small object editing (SOE), which focuses on text-based image inpainting within a constrained, small-sized area. Despite the remarkable success have been achieved by current image inpainting approaches, their application to the SOE task generally results in failure cases such as Object Missing, Text-Image Mismatch, and Distortion. These failures stem from the limited use of small-sized objects in training datasets and the downsampling operations employed by U-Net models, which hinders accurate generation. To overcome these challenges, we introduce a novel training-based approach, SOEDiff, aimed at enhancing the capability of baseline models like StableDiffusion in editing small-sized objects while minimizing training costs. Specifically, our method involves two key components: SO-LoRA, which efficiently fine-tunes low-rank matrices, and Cross-Scale Score Distillation loss, which leverages high-resolution predictions from the pre-trained teacher diffusion model. Our method presents significant improvements on the test dataset collected from MSCOCO and OpenImage, validating the effectiveness of our proposed method in small object editing. In particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset, we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID. Our project page can be found in https://soediff.github.io/.	Introduces SOEDiff, a novel training-based approach for text-based small object editing (SOE) in images, enhancing the capabilities of baseline models like StableDiffusion.	Addresses the limitations of existing image inpainting models in handling small object editing, a task crucial for subtle image manipulations.	Employs SO-LoRA for efficient fine-tuning of low-rank matrices and a Cross-Scale Score Distillation loss leveraging high-resolution predictions from a pre-trained teacher diffusion model.	Significantly improves text-to-image alignment and reduces object-missing, mismatch, and distortion issues in small object editing. Outperforms baselines like SD-I and BlendedDM on MSCOCO and OpenImage datasets, showing significant gains in CLIP-Score and FID. Demonstrates extended applications in object removal and replacement tasks beyond basic inpainting.	Limited exploration of different crop sizes and aspect ratios for the teacher model input. Further research on reducing the computational cost associated with VAE fine-tuning.	small object editing, image editing, lora, score distillation, diffusion models
2405.08911 Report	CLIP with Quality Captions: A Strong Pretraining for Vision Tasks	Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel	CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.	This paper investigates the impact of caption quality on CLIP's performance in downstream dense prediction tasks, showing that CLIP with high-quality captions outperforms many supervised, self-supervised, and weakly supervised methods.	While CLIP excels in zero-shot classification and retrieval, its performance in dense prediction tasks has lagged behind other methods. This work demonstrates that caption quality is crucial for CLIP's performance in these tasks.	The authors compare the performance of CLIP models pretrained on datasets with varying caption quality (ALIGN, DataComp, DataCompDR). They fine-tune and evaluate these models on ImageNet-1K, MS COCO, ADE20k, and NYUv2 benchmarks.	CLIP pretrained on DataCompDR, a dataset with high-quality captions, achieves state-of-the-art results on dense prediction tasks, outperforming methods like MAE and MAWS. Improving caption quality leads to better data efficiency, with CLIP models trained on smaller subsets of DataCompDR matching the performance of models trained on larger subsets of DataComp. CLIP pretraining significantly benefits mobile architectures, achieving accuracy comparable to larger models like Swin-L on semantic segmentation.	The study primarily focuses on ViT-B/16 architecture, and further investigation is needed to assess the impact of caption quality on larger CLIP models. Future work could explore the development of more advanced captioning methods to further improve CLIP's performance in dense prediction tasks.	clip, image captioning, dense prediction, self-supervised learning, mobile architectures
2405.08733 Report	A Simple Approach to Differentiable Rendering of SDFs	Zichen Wang, Xi Deng, Ziyi Zhang, Wenzel Jakob, Steve Marschner	We present a simple algorithm for differentiable rendering of surfaces represented by Signed Distance Fields (SDF), which makes it easy to integrate rendering into gradient-based optimization pipelines. To tackle visibility-related derivatives that make rendering non-differentiable, existing physically based differentiable rendering methods often rely on elaborate guiding data structures or reparameterization with a global impact on variance. In this article, we investigate an alternative that embraces nonzero bias in exchange for low variance and architectural simplicity. Our method expands the lower-dimensional boundary integral into a thin band that is easy to sample when the underlying surface is represented by an SDF. We demonstrate the performance and robustness of our formulation in end-to-end inverse rendering tasks, where it obtains results that are competitive with or superior to existing work.	This paper introduces a simple and robust algorithm for differentiable rendering of surfaces represented by Signed Distance Fields (SDFs), enabling easier integration of rendering into gradient-based optimization pipelines.	Differentiable rendering, crucial for applications like inverse rendering and 3D reconstruction, often suffers from visibility discontinuities. Existing solutions either rely on complex data structures or increase gradient variance. This method offers an alternative that embraces a small, controlled bias in exchange for low variance and simplicity.	The core idea is to relax the strict visibility boundary to a thin band around the object silhouette. This transforms the challenging lower-dimensional boundary integral into a simpler area integral that can be efficiently estimated using standard Monte Carlo sampling.	The method achieves high-quality inverse rendering results, comparable to or surpassing existing techniques. It exhibits robustness to the choice of SDF threshold, a key parameter controlling the relaxation. The simplicity of the approach makes it easy to implement and integrate into existing rendering systems.	The method introduces a small bias due to the relaxation of the visibility boundary. The optimal SDF threshold may require tuning depending on the scene scale.	differentiable rendering, signed distance functions, inverse rendering, 3d reconstruction, monte carlo methods
2405.08720 Report	The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective	Andrew Shin, Yusuke Mori, Kunitake Kaneko	Text-to-video generation task has witnessed a notable progress, with the generated outcomes reflecting the text prompts with high fidelity and impressive visual qualities. However, current text-to-video generation models are invariably focused on conveying the visual elements of a single scene, and have so far been indifferent to another important potential of the medium, namely a storytelling. In this paper, we examine text-to-video generation from a storytelling perspective, which has been hardly investigated, and make empirical remarks that spotlight the limitations of current text-to-video generation scheme. We also propose an evaluation framework for storytelling aspects of videos, and discuss the potential future directions.	This paper investigates the capabilities and limitations of current text-to-video generation models in storytelling, a largely unexplored area.	Current text-to-video models excel at generating visually appealing single scenes or movements but struggle to weave coherent narratives. This paper aims to bridge this gap and explore storytelling potential in video generation.	The authors generate videos from three types of text prompts: 1) short stories, 2) scripts with dialogue, and 3) existing captions from a video storytelling dataset. They then evaluate these videos using established visual quality metrics (FVD, Inception Score), a novel cyclical evaluation framework (T2Vid2T) that assesses text-video alignment, and human evaluations focusing on story components (character, setting, plot) and overall comprehensibility.	Current text-to-video generation models struggle to maintain narrative coherence across multiple scenes, often resulting in visually appealing but narratively disjointed videos. Videos generated from factual descriptions (captions) show better visual quality and story coherence than those generated from short stories or scripts, highlighting a potential bias in training data. Adding narration to videos generally improves story comprehension, but mismatches between generated visuals and narration can hinder understanding.	The study relies heavily on manual evaluation for storytelling aspects due to the lack of standardized automatic metrics. The research primarily focuses on visual storytelling, leaving exploration of incorporating audio cues (e.g., dialogue, sound effects) for future work.	text-to-video generation, storytelling, video evaluation, narrative coherence, ai and creativity
2405.08055 Report	DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation	Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu	Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.	Presents DiffTF++, a diffusion-based feed-forward framework for generating diverse 3D objects across many categories using a single model.	Addresses limitations of existing optimization-based and feed-forward 3D generation methods in efficiency, generalizability, and handling diverse object appearances.	Employs triplane representation, 3D-aware transformer for global 3D knowledge and specialized feature extraction, 3D-aware encoder/decoder for enhanced semantic understanding, multi-view reconstruction loss for consistency between stages, and 3D-aware refinement for artifact elimination and detail enhancement.	Achieves state-of-the-art performance on ShapeNet and OmniObject3D datasets in terms of 2D and 3D metrics. Generates high-quality 3D objects with realistic topology, rich texture, and fine details, outperforming previous methods. Demonstrates strong generalization ability for large-vocabulary 3D object generation, handling diverse categories with complex geometry and textures.	Current implementation is limited to relatively low-resolution triplanes. Exploration of incorporating text-guided generation capabilities for more controllable and diverse 3D object synthesis.	3d generation, diffusion models, transformer, triplane representation, large-vocabulary generation
2405.08054 Report	Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning	Wenqi Dong, Bangbang Yang, Lin Ma, Xiao Liu, Liyuan Cui, Hujun Bao, Yuewen Ma, Zhaopeng Cui	As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.	Coin3D is a novel controllable and interactive 3D asset modeling framework that uses coarse geometry proxies, assembled from basic shapes, to guide the generation of detailed 3D objects.	Existing 3D generative methods lack controllability and efficiency, relying on text prompts or images that inadequately represent 3D shapes. Coin3D addresses this by offering a user-friendly approach for creating and editing 3D assets with precise 3D control.	Coin3D leverages a 3D adapter module to integrate a voxelized 3D proxy into a multiview diffusion process. It employs a proxy-bounded editing strategy for precise local adjustments and a progressive volume caching mechanism for responsive preview.	Coin3D enables generating 3D objects with faithful adherence to user-provided coarse shapes, outperforming image-based generation methods in quality and user studies. Compared to existing controllable generation methods, Coin3D shows superior control and avoids issues like overgrowth or incomplete details, while being significantly faster in providing feedback. The interactive workflow allows for seamlessly adding, adjusting, or regenerating specific parts of the object with responsive preview, making it suitable for an iterative design process.	The initial 2D image candidate generation, while providing a quick preview, depends on prompt engineering and might require further enhancement for complex textures or backgrounds. The resolution of generated details is limited by the base diffusion model, and future work could explore high-resolution optimization or material-disentangled models.	3d object generation, controllable generation, interactive modeling, diffusion models, 3d-aware conditioning
2405.07992 Report	MambaOut: Do We Really Need Mamba for Vision?	Weihao Yu, Xinchao Wang	Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut	This paper investigates the necessity of Mamba, an RNN-like architecture, for visual recognition tasks, arguing that it is not essential for image classification but potentially beneficial for detection and segmentation.	The quadratic complexity of attention in Transformers poses challenges for long sequences, motivating the exploration of alternative token mixers like Mamba, particularly for vision tasks where their efficacy remains unclear.	The authors analyze the suitability of Mamba for long-sequence and autoregressive tasks, then examine the characteristics of visual recognition tasks against these criteria. They introduce MambaOut models, which remove the core SSM component of Mamba, to empirically evaluate its necessity.	MambaOut models, despite lacking SSM, consistently outperform visual Mamba models on ImageNet image classification, supporting the hypothesis that SSM is unnecessary for this task. In contrast, MambaOut models fall short of state-of-the-art visual Mamba models in object detection and semantic segmentation, highlighting the potential benefits of SSM for long-sequence visual tasks. Visual Mamba models, while showing promise for long sequences, still lag behind state-of-the-art convolution and attention-based models in visual recognition tasks, indicating a need for further development.	The study primarily focuses on conceptual analysis and empirical verification of Mamba's efficacy for visual tasks, leaving the exploration of RNN and Transformer integration for future work. The paper acknowledges computational resource limitations and suggests further investigation into Mamba and RNN concepts for large language models (LLMs) and large multimodal models (LMMs) as future directions.	mamba, vision transformer, image classification, object detection, semantic segmentation
2405.07919 Report	Exploring the Low-Pass Filtering Behavior in Image Super-Resolution	Haoyu Deng, Zijing Xu, Yule Duan, Xiao Wu, Wenjie Shu, Liang-Jian Deng	Deep neural networks for image super-resolution (ISR) have shown significant advantages over traditional approaches like the interpolation. However, they are often criticized as 'black boxes' compared to traditional approaches with solid mathematical foundations. In this paper, we attempt to interpret the behavior of deep neural networks in ISR using theories from the field of signal processing. First, we report an intriguing phenomenon, referred to as `the sinc phenomenon.' It occurs when an impulse input is fed to a neural network. Then, building on this observation, we propose a method named Hybrid Response Analysis (HyRA) to analyze the behavior of neural networks in ISR tasks. Specifically, HyRA decomposes a neural network into a parallel connection of a linear system and a non-linear system and demonstrates that the linear system functions as a low-pass filter while the non-linear system injects high-frequency information. Finally, to quantify the injected high-frequency information, we introduce a metric for image-to-image tasks called Frequency Spectrum Distribution Similarity (FSDS). FSDS reflects the distribution similarity of different frequency components and can capture nuances that traditional metrics may overlook. Code, videos and raw experimental results for this paper can be found in: https://github.com/RisingEntropy/LPFInISR.	This paper unveils the "sinc phenomenon," demonstrating that the impulse response of image super-resolution (ISR) networks acts as a low-pass filter, and introduces Hybrid Response Analysis (HyRA) to interpret ISR networks by separating them into linear (low-pass filter) and non-linear (high-frequency injection) components.	This work enhances the interpretability of ISR networks, typically criticized as "black boxes," by linking them to traditional signal processing theories.	The authors analyze impulse responses of various ISR networks, visualize feature maps, and compare performance with traditional low-pass filters. They also propose a new metric, Frequency Spectrum Distribution Similarity (FSDS), to quantify high-frequency information injection.	The impulse responses of many ISR networks, regardless of CNN or transformer-based architecture, resemble sinc functions, suggesting an inherent low-pass filtering behavior. HyRA demonstrates that the non-linear component of ISR networks injects high-frequency details, compensating for the low-pass filtering effect. FSDS effectively captures high-frequency distortions, unlike PSNR, SSIM, or LPIPS, highlighting its sensitivity and necessity in evaluating ISR quality.	The "sinc phenomenon" is not universally observed, particularly in networks trained with adversarial loss, suggesting a connection to loss function choices. Future work includes investigating the impact of different window functions on impulse responses and exploring why certain networks treat specific high-frequency information as low-frequency.	image super-resolution, deep learning interpretability, signal processing, low-pass filtering, frequency spectrum analysis
2405.07913 Report	CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models	Nick Stracke, Stefan Andreas Baumann, Joshua M. Susskind, Miguel Angel Bautista, Björn Ommer	Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches	This paper introduces \textbf{\methodname{}}, a novel approach for conditioning text-to-image diffusion models that unifies style and structure conditioning under the same formulation using conditional LoRA blocks.	\textbf{\methodname{}} addresses the open problem of guiding the generative process of text-to-image models to consider detailed forms of conditioning reflecting both style and structure information in a zero-shot manner.	\textbf{\methodname{}} leverages the low-rank property of LoRAs to regularize conditioning and applies a conditional affine transformation to the low-dimensional intermediate embedding in the LoRA. This allows for efficient adaptation of both convolutional and attention layers in diffusion models for local (structure) and global (style) conditioning.	\textbf{\methodname{}} achieves state-of-the-art performance on CLIP-I and CLIP-T scores for style conditioning, outperforming both dedicated adapters and some models trained from scratch. \textbf{\methodname{}} demonstrates superior adherence to structure guidance compared to existing methods like ControlNet and T2I-Adapter, as evidenced by quantitative metrics on depth and HED map reconstruction tasks. Ablation studies highlight the modularity of \textbf{\methodname{}}, showing that adapting cross-attention layers yields the best performance for style conditioning and allows for logical fusion with text prompts.	The effectiveness of \textbf{\methodname{}} has only been demonstrated on text-to-image diffusion models based on Stable Diffusion, further investigation on fully transformer-based diffusion models and large language models is needed. While improving control over image generation, \textbf{\methodname{}} could potentially be misused to generate more believable disinformation or harmful content.	text-to-image generation, diffusion models, conditional image synthesis, low-rank adaptation (lora), style and structure control
2405.07813 Report	Localizing Task Information for Improved Model Merging and Compression	Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, François Fleuret, Pascal Frossard	Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance.	This paper introduces TALL-masks, a method to localize task-specific information in multi-task vectors generated from merging fine-tuned models, enabling both model compression and improved model merging.	Model merging and compression are crucial for efficiently leveraging and deploying large, fine-tuned models, but existing methods suffer from performance loss due to task interference.	TALL-masks identifies task-specific weight subsets by minimizing the L1 distance between the original task vector and a masked version of the multi-task vector. This enables the creation of compressed models or improved merged models by eliminating catastrophic and selfish weights.	Task-specific information is preserved in merged models, and TALL-masks can effectively recover near-original performance. TALL-masks enables compression of fine-tuned models to a fraction of their original size (e.g., 13.7% for a 20-task benchmark) with minimal performance loss. Consensus Merging, which leverages TALL-masks to eliminate detrimental weights, consistently improves the performance of existing model merging methods like Task Arithmetic and TIES across vision and NLP tasks.	The optimal weight-pruning threshold for Consensus Merging varies depending on factors like the model merging method and the number of tasks. Further research can explore the impact of different merging strategies on weight profiles and optimize for specific applications.	model merging, model compression, task arithmetic, weight interpolation, task interference
2405.07648 Report	CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution	Qingguo Liu, Chenyi Zhuang, Pan Gao, Jie Qin	Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details, and thus we introduce a diffusion-based module $CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low- and high-resolution images, and then approximate the real distribution given only low-resolution information. Moreover, we apply an adaptive SR network $CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods, we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods, establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at \href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}.	This paper proposes CDFormer, a novel Content-aware Degradation-driven Transformer network for Blind image Super-Resolution (BSR). CDFormer leverages a two-stage training strategy to capture both degradation and content representations through a Content Degradation Prior (CDP) generation module and a CDP-guided SR network.	Existing BSR methods typically focus solely on estimating kernel or degradation information, neglecting crucial content details. This can lead to suboptimal performance, particularly in challenging scenarios with complex degradations.	The method utilizes a two-stage training approach. Stage 1: A ground-truth encoder (E_GT) learns CDP from paired HR and LR images to guide the SR network. Stage 2: An LR encoder (E_LR) and a diffusion model generate CDP solely from LR images.	CDFormer achieves state-of-the-art performance on various BSR benchmarks under blind settings. The introduction of CDP enables CDFormer to reconstruct SR images with sharper and more harmonious textures, even in cases of severe degradation. The diffusion model effectively recreates CDP from LR images, demonstrating its potential in super-resolution tasks.	The performance improvement of CDFormer is limited when dealing with LR images with high noise levels. Future work could explore the integration of other techniques like contrastive learning to further enhance the robustness and accuracy of CDFormer.	blind image super-resolution, diffusion models, transformer networks, content degradation prior, deep learning
2405.07392 Report	NGD-SLAM: Towards Real-Time SLAM for Dynamic Environments without GPU	Yuhao Zhang	Accurate and robust camera tracking in dynamic environments presents a significant challenge for visual SLAM (Simultaneous Localization and Mapping). Recent progress in this field often involves the use of deep learning techniques to generate mask for dynamic objects, which usually require GPUs to operate in real-time (30 fps). Therefore, this paper proposes a novel visual SLAM system for dynamic environments that obtains real-time performance on CPU by incorporating a mask prediction mechanism, which allows the deep learning method and the camera tracking to run entirely in parallel at different frequencies such that neither waits for the result from the other. Based on this, it further introduces a dual-stage optical flow tracking approach and employs a hybrid usage of optical flow and ORB features, which significantly enhance the efficiency and robustness of the system. Compared with state-of-the-art methods, this system maintains high localization accuracy in dynamic environments while achieving a tracking frame rate of 56 fps on a single laptop CPU without any hardware acceleration, thus proving that deep learning methods are still feasible for dynamic SLAM even without GPU support. Based on the available information, this is the first SLAM system to achieve this.	This paper presents NGD-SLAM, a real-time visual SLAM system for dynamic environments that achieves real-time performance on CPU by incorporating a novel mask prediction mechanism and dual-stage optical flow tracking.	Accurate and robust camera tracking in dynamic environments is challenging for visual SLAM. Existing methods often rely on computationally expensive deep learning models, requiring GPUs for real-time performance.	The system uses a mask prediction mechanism based on previous segmentation results and a dual-stage tracking approach employing optical flow for both dynamic and static feature tracking, coupled with ORB features for keyframe tracking.	NGD-SLAM achieves localization accuracy comparable to state-of-the-art methods in dynamic environments. It maintains high efficiency, achieving a tracking frame rate of 56 fps on a single laptop CPU without hardware acceleration. The proposed system is the first to demonstrate real-time performance on CPU for dynamic SLAM with deep learning-based dynamic object detection.	The mask prediction mechanism may fail when a new dynamic object suddenly enters the scene. Future work will focus on improving the system's robustness in handling complex and large-scale dynamic environments and exploring other lightweight deep learning models for improved efficiency.	visual slam, dynamic environments, deep learning, mask prediction, optical flow tracking, real-time
2405.07346 Report	Understanding and Evaluating Human Preferences for AI Generated Images with Instruction Tuning	Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min	Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent years, among which AI-based image generation has gained widespread attention due to its efficient and imaginative image creation ability. However, AI-generated Images (AIGIs) may not satisfy human preferences due to their unique distortions, which highlights the necessity to understand and evaluate human preferences for AIGIs. To this end, in this paper, we first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+, which provides human visual preference scores and detailed preference explanations from three perspectives including quality, authenticity, and correspondence. Then, based on the constructed AIGCIQA2023+ database, this paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning. Specifically, the MINT-IQA model first learn and evaluate human preferences for AI-generated Images from multi-perspectives, then via the vision-language instruction tuning strategy, MINT-IQA attains powerful understanding and explanation ability for human visual preference on AIGIs, which can be used for feedback to further improve the assessment capabilities. Extensive experimental results demonstrate that the proposed MINT-IQA model achieves state-of-the-art performance in understanding and evaluating human visual preferences for AIGIs, and the proposed model also achieves competing results on traditional IQA tasks compared with state-of-the-art IQA models. The AIGCIQA2023+ database and MINT-IQA model will be released to facilitate future research.	This paper introduces AIGCIQA2023+, an extended dataset for evaluating human preferences in AI-generated images, and proposes MINT-IQA, a novel method for evaluating and explaining these preferences from multiple perspectives using instruction tuning.	Understanding human preferences for AI-generated images is crucial for improving the quality of generated content and bridging the gap between human expectations and AI capabilities.	The authors construct AIGCIQA2023+ with fine-grained preference annotations and develop MINT-IQA, which leverages a multi-modal Q-Former for representation learning, score regression for preference prediction, and vision-language instruction tuning for detailed explanation.	MINT-IQA achieves state-of-the-art performance on three AIGC IQA datasets, demonstrating its effectiveness in evaluating human preferences from multiple perspectives. The model also demonstrates superior performance on traditional IQA databases, highlighting its versatility in assessing image quality. Ablation studies validate the contribution of each module in MINT-IQA, emphasizing the importance of instruction tuning and multi-perspective evaluation.	The current model is limited by the scale of the AIGCIQA2023+ dataset. Future work can focus on expanding the dataset and exploring different modalities beyond text and images.	artificial intelligence generated content (aigc), image quality assessment (iqa), human visual preference, instruction tuning, multi-perspective evaluation
2405.07306 Report	Point Resampling and Ray Transformation Aid to Editable NeRF Models	Zhenyang Li, Zilong Chen, Feifan Qu, Mingqing Wang, Yizhou Zhao, Kai Zhang, Yifan Peng	In NeRF-aided editing tasks, object movement presents difficulties in supervision generation due to the introduction of variability in object positions. Moreover, the removal operations of certain scene objects often lead to empty regions, presenting challenges for NeRF models in inpainting them effectively. We propose an implicit ray transformation strategy, allowing for direct manipulation of the 3D object's pose by operating on the neural-point in NeRF rays. To address the challenge of inpainting potential empty regions, we present a plug-and-play inpainting module, dubbed differentiable neural-point resampling (DNR), which interpolates those regions in 3D space at the original ray locations within the implicit space, thereby facilitating object removal & scene inpainting tasks. Importantly, employing DNR effectively narrows the gap between ground truth and predicted implicit features, potentially increasing the mutual information (MI) of the features across rays. Then, we leverage DNR and ray transformation to construct a point-based editable NeRF pipeline PR^2T-NeRF. Results primarily evaluated on 3D object removal & inpainting tasks indicate that our pipeline achieves state-of-the-art performance. In addition, our pipeline supports high-quality rendering visualization for diverse editing operations without necessitating extra supervision.	This paper introduces a novel approach for object removal and scene inpainting in neural radiance fields (NeRFs) by combining implicit ray transformation with a differentiable neural-point resampling (DNR) strategy.	Object manipulation in NeRFs, particularly removal and inpainting, presents challenges due to the need for precise supervision and the potential for artifacts in the edited regions. This work addresses these issues by directly manipulating rays and developing a method for consistent inpainting.	The method involves: 1) Implicit ray transformation for object manipulation (rotation, translation, scaling, removal). 2) Target object segmentation using a pretrained SAM model and depth estimation. 3) Differentiable Neural-Point Resampling (DNR) to interpolate features in empty regions, enhancing consistency and visual quality. 4) Fine-tuning with a combination of reconstruction, perceptual, depth, and sparse losses.	The proposed method achieves state-of-the-art performance on scene object removal and inpainting benchmarks. DNR strategies, particularly GWFA, are shown to significantly improve inpainting quality and convergence speed. Theoretical analysis and experimental validation demonstrate that DNR effectively increases mutual information among rays, leading to better feature consistency and inpainting results.	The method's reliance on pretrained models for depth estimation, object segmentation, and inpainting could introduce limitations depending on their performance. Future work includes jointly optimizing depth estimation with object masks and integrating DNR directly into the NeRF rendering process.	neural radiance fields, scene editing, object removal, scene inpainting, differentiable rendering
2405.07288 Report	Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning	Masane Fuchi, Tomohiro Takagi	Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder.	This paper proposes a novel, fast method for erasing specific concepts from text-to-image diffusion models by updating the text encoder using few-shot unlearning.	Existing methods for removing undesirable concepts from pre-trained text-to-image models are computationally expensive and often lead to a decrease in generation quality. This paper addresses these limitations with a faster, more efficient approach.	The proposed method leverages few-shot unlearning by maximizing the stable diffusion loss with a reversed gradient, focusing on the text encoder while keeping the U-Net parameters unchanged. This forces the model to 'forget' the target concept represented by the text.	The method achieves a significant speedup (60-900 times) compared to baseline methods, enabling concept erasure within 10 seconds. Concept erasure is achieved by providing only a few real images related to the target concept. The method implicitly transitions to semantically similar concepts, leading to more natural concept erasure without requiring explicit anchor concepts.	The method may face challenges erasing concepts with large semantic spaces. Future work includes developing more robust evaluation metrics for concept erasure and exploring alternative methods like saliency map-based approaches.	concept erasure, text-to-image diffusion models, few-shot unlearning, text encoder, stable diffusion
2405.07145 Report	Stable Signature is Unstable: Removing Image Watermark from Diffusion Models	Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong	Watermark has been widely deployed by industry to detect AI-generated images. A recent watermarking framework called \emph{Stable Signature} (proposed by Meta) roots watermark into the parameters of a diffusion model's decoder such that its generated images are inherently watermarked. Stable Signature makes it possible to watermark images generated by \emph{open-source} diffusion models and was claimed to be robust against removal attacks. In this work, we propose a new attack to remove the watermark from a diffusion model by fine-tuning it. Our results show that our attack can effectively remove the watermark from a diffusion model such that its generated images are non-watermarked, while maintaining the visual quality of the generated images. Our results highlight that Stable Signature is not as stable as previously thought.	This paper introduces a new model-targeted attack method to remove in-generation watermarks from open-source diffusion models by fine-tuning the decoder.	The misuse of AI-generated images presents risks of misinformation, making watermarking crucial for detection. Existing methods are vulnerable in open-source settings, and current removal attacks are either inefficient or significantly degrade image quality.	The attack involves two steps: 1) Estimating the denoised latent vector for non-watermarked images in an attacking dataset, with different approaches for encoder-aware and encoder-agnostic scenarios. 2) Fine-tuning the decoder using the estimated latent vectors and non-watermarked images to minimize reconstruction error and fool a discriminator.	The attack successfully evades watermark detection with high evasion rates and low bitwise accuracy. It maintains significantly better image quality (FID) than the existing model purification attack. The attack is more efficient than per-image-based removal attacks when processing a large number of images.	The fine-tuning process in the encoder-agnostic scenario can be time-consuming. Future work includes exploring more robust watermarking methods for open-source diffusion models.	image watermarking, diffusion models, watermark removal, generative ai, adversarial attacks
2405.07023 Report	Efficient Real-world Image Super-Resolution Via Adaptive Directional Gradient Convolution	Long Peng, Yang Cao, Renjing Pei, Wenbo Li, Jiaming Guo, Xueyang Fu, Yang Wang, Zheng-Jun Zha	Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifacts introduced by degradation cues during the imaging process in real LR increase the disorder of the overall image details, further complicating the perception of intrinsic gradient arrangement. To address these challenges, we innovatively introduce kernel-wise differential operations within the convolutional kernel and develop several learnable directional gradient convolutions. These convolutions are integrated in parallel with a novel linear weighting mechanism to form an Adaptive Directional Gradient Convolution (DGConv), which adaptively weights and fuses the basic directional gradients to improve the gradient arrangement perception capability for both regular and irregular textures. Coupled with DGConv, we further devise a novel equivalent parameter fusion method for DGConv that maintains its rich representational capabilities while keeping computational costs consistent with a single Vanilla Convolution (VConv), enabling DGConv to improve the performance of existing super-resolution networks without incurring additional computational expenses. To better leverage the superiority of DGConv, we further develop an Adaptive Information Interaction Block (AIIBlock) to adeptly balance the enhancement of texture and contrast while meticulously investigating the interdependencies, culminating in the creation of a DGPNet for Real-SR through simple stacking. Comparative results with 15 SOTA methods across three public datasets underscore the effectiveness and efficiency of our proposed approach.	This paper introduces DGConv, a novel 'plug-and-play' convolutional unit that enhances detail and contrast representation in real-world image super-resolution (Real-SR) without increasing computational cost.	Real-world low-resolution images suffer from complex degradations that disrupt texture arrangements and statistical properties, making detail and contrast restoration challenging. Existing methods struggle to address complex gradient arrangements and often introduce computational overhead.	DGConv integrates learnable directional gradient and aggregation operations to enhance perception of regular and irregular textures, and image contrast. An equivalent parameter fusion method maintains computational cost comparable to Vanilla Convolution (VConv). An Adaptive Information Interaction Block (AIIBlock) balances texture and contrast enhancement. These components are combined in the Directional Gradient Perceiving Network (DGPNet).	DGPNet outperforms 15 state-of-the-art Real-SR methods on benchmark datasets, achieving superior detail recovery and contrast enhancement with low computational complexity. Replacing VConv with DGConv in five classical SR methods consistently improves performance. Ablation studies confirm the contribution of each component in DGConv and the effectiveness of using local statistical mean for gradient and aggregation operations.	Exploration of additional directional arrangement convolutions to further enhance DGConv's representation capacity. Validation and extension of DGConv to other image and video super-resolution and restoration tasks.	image super-resolution, real-world image super-resolution, deep learning, convolutional neural networks, directional gradient convolution
2405.06948 Report	Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation	Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang	Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.	This paper introduces SE-Guidance, a training-free method for subject-driven text-to-image generation that enhances attention maps to improve attribute binding and feature injection for each subject, particularly in compositional generation.	Existing subject-driven generation models often require tedious fine-tuning and struggle with object missing and attribute mixing in compositional generation. This method addresses these limitations by providing a training-free approach.	The method utilizes an image prompt adapter to inject subject representations and employs SE-Guidance during inference. SE-Guidance extracts subject attention maps, injects subject representations in the forward process, and enhances attention to subjects in the backward process.	The method achieves comparable results to fine-tuned methods in single-concept generation while demonstrating superior text-image alignment. In compositional generation, SE-Guidance effectively addresses object missing and attribute mixing, surpassing baseline methods in preserving subject fidelity and text alignment. A novel metric, GroundingScore, is introduced for a more accurate evaluation of subject alignment in compositional generation.	The method's effectiveness is limited by the expressive power of the underlying generative model, particularly for unique or rare objects. Addressing fine-grained composition relations with more than two subjects remains challenging and requires further exploration.	text-to-image generation, diffusion models, subject-driven generation, compositional generation, attention mechanisms
2405.06914 Report	Non-confusing Generation of Customized Concepts in Diffusion Models	Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang	We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels -- we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.	This paper introduces CLIF, a novel approach to prevent inter-concept visual confusion in composing multiple customized concepts using text-guided diffusion models.	Existing methods for customized concept generation often lead to visual confusion, especially in complex compositions, hindering the generation of novel and distinct concepts.	CLIF employs a two-stage fine-tuning approach: 1) Contrastive fine-tuning of the text encoder with an over-segmented concept dataset to distinguish textual embeddings. 2) Fine-tuning the text-to-image decoder to synthesize non-confusing images using the decoupled concept embeddings.	CLIF effectively mitigates identity loss, attribute leaking, and concept missing in multi-concept customization. CLIF demonstrates superior performance in both qualitative and quantitative evaluations compared to state-of-the-art methods. Ablation studies confirm the importance of global, regional, and mix augmentation in enhancing identity preservation, attribute binding, and concept attendance respectively.	Generating more than 2 customized concepts simultaneously, though achievable, has limitations requiring further research. The potential misuse of CLIF for creating deepfakes necessitates robust ethical guidelines and monitoring.	text-guided diffusion models, concept customization, multi-concept generation, contrastive learning, image generation
2405.06535 Report	Controllable Image Generation With Composed Parallel Token Prediction	Jamie Stirling, Noura Al-Moubayed	Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fr\'echet Inception Distance (FID) scores. Our method attains an average generation accuracy of $80.71\%$ across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our method offers a $2.3\times$ to $12\times$ speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.	This paper introduces a novel method for controllable conditional image generation by composing discrete iterative generative processes, achieving state-of-the-art accuracy.	Compositional generalisation in image generation, particularly the ability to handle unfamiliar combinations of concepts, is crucial for creating AI with human-like intelligence.	The method involves: (1) deriving formulae for logical operations on probabilistic outputs of discrete models, (2) adapting this for parallel token prediction, and (3) employing concept weighting for enhanced control.	The method achieves state-of-the-art generation accuracy across three datasets (FFHQ, Positional CLEVR, Relational CLEVR) outperforming existing techniques. It offers competitive Fréchet Inception Distance (FID) scores, indicating good image quality. The approach is computationally efficient, demonstrating a 2.3x to 12x speedup compared to similar continuous methods.	The method requires multiple feed-forward operations, potentially impacting efficiency, although this is mitigated by fast convergence. The assumption of independent input conditions might not always hold in real-world scenarios, potentially limiting generalisation. Future work could explore learned concept-weighting policies for enhanced controllability	image generation, compositional generalisation, discrete generative models, parallel token prediction, controllable generation
2405.06525 Report	Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation	Xiaowen Ma, Zhenliang Ni, Xinghao Chen	Vanilla pixel-level classifiers for semantic segmentation are based on a certain paradigm, involving the inner product of fixed prototypes obtained from the training set and pixel features in the test image. This approach, however, encounters significant limitations, i.e., feature deviation in the semantic domain and information loss in the spatial domain. The former struggles with large intra-class variance among pixel features from different images, while the latter fails to utilize the structured information of semantic objects effectively. This leads to blurred mask boundaries as well as a deficiency of fine-grained recognition capability. In this paper, we propose a novel Semantic and Spatial Adaptive (SSA) classifier to address the above challenges. Specifically, we employ the coarse masks obtained from the fixed prototypes as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image. The adapted prototypes in semantic and spatial domains are then simultaneously considered to accomplish classification decisions. In addition, we propose an online multi-domain distillation learning strategy to improve the adaption process. Experimental results on three publicly available benchmarks show that the proposed SSA significantly improves the segmentation performance of the baseline models with only a minimal increase in computational cost. Code is available at https://github.com/xwmaxwma/SSA.	This paper presents a Semantic and Spatial Adaptive (SSA) classifier designed to enhance pixel-level classification for semantic segmentation.	Vanilla pixel-level classifiers suffer from limitations like feature deviation in the semantic domain and information loss in the spatial domain, leading to inaccurate segmentation.	The SSA classifier uses coarse masks to guide the adaptation of fixed prototypes towards semantic and spatial centers in test images, capturing both semantic and spatial relationships for classification. It also employs online multi-domain distillation learning to refine feature representation and constrain prototype adaptation.	SSA significantly improves segmentation performance on ADE20K, PASCAL-Context, and COCO-Stuff-10K datasets with minimal computational overhead. It outperforms other state-of-the-art classifiers like GMMSeg and CAC. The method enables lightweight models to achieve state-of-the-art performance in real-time segmentation tasks.	The method requires 1.1 times more training time compared to the baseline due to the use of a teacher classifier. Future work can explore the integration of attention mechanisms into SSA to further enhance feature representation.	semantic segmentation, pixel-level classification, adaptive classifier, multi-domain distillation, spatial reasoning
2405.06461 Report	SketchDream: Sketch-based Text-to-3D Generation and Editing	Feng-Lin Liu, Hongbo Fu, Yu-Kun Lai, Lin Gao	Existing text-based 3D generation methods generate attractive results but lack detailed geometry control. Sketches, known for their conciseness and expressiveness, have contributed to intuitive 3D modeling but are confined to producing texture-less mesh models within predefined categories. Integrating sketch and text simultaneously for 3D generation promises enhanced control over geometry and appearance but faces challenges from 2D-to-3D translation ambiguity and multi-modal condition integration. Moreover, further editing of 3D models in arbitrary views will give users more freedom to customize their models. However, it is difficult to achieve high generation quality, preserve unedited regions, and manage proper interactions between shape components. To solve the above issues, we propose a text-driven 3D content generation and editing method, SketchDream, which supports NeRF generation from given hand-drawn sketches and achieves free-view sketch-based local editing. To tackle the 2D-to-3D ambiguity challenge, we introduce a sketch-based multi-view image generation diffusion model, which leverages depth guidance to establish spatial correspondence. A 3D ControlNet with a 3D attention module is utilized to control multi-view images and ensure their 3D consistency. To support local editing, we further propose a coarse-to-fine editing approach: the coarse phase analyzes component interactions and provides 3D masks to label edited regions, while the fine stage generates realistic results with refined details by local enhancement. Extensive experiments validate that our method generates higher-quality results compared with a combination of 2D ControlNet and image-to-3D generation techniques and achieves detailed control compared with existing diffusion-based 3D editing approaches.	SketchDream, a novel method for text-driven 3D content generation and editing that leverages user-provided sketches to enable fine-grained control over object geometry and appearance.	Existing text-to-3D generation methods lack detailed control over geometry, while sketch-based methods are limited in generating textured 3D models. SketchDream combines the expressiveness of sketches with the semantic richness of text prompts to allow users to create and edit high-quality 3D content with greater precision.	The method utilizes a sketch-based multi-view image generation diffusion model to generate realistic multi-view images from input sketches and text prompts. It employs depth-guided warping to establish spatial correspondence and a 3D attention module for cross-view consistency. A coarse-to-fine editing framework is introduced for local editing, refining the initial results with a precise 3D mask and a local rendering strategy for enhanced quality and sketch faithfulness.	SketchDream generates higher-quality 3D content than existing sketch-based text-to-3D baselines, achieving better geometry and appearance realism. The method enables detailed control over 3D model generation by combining text prompts for appearance and sketches for shape and texture. SketchDream outperforms existing sketch-based 3D editing approaches, offering more realistic editing results while preserving unedited regions.	The generation and editing quality may be degraded for objects that are rare in the training dataset. The current implementation is computationally expensive, limiting interactive generation and editing.	sketch-based interaction, diffusion models, neural radiance fields, 3d generation, 3d editing
2405.06408 Report	I3DGS: Improve 3D Gaussian Splatting from Multiple Dimensions	Jinwei Lin	3D Gaussian Splatting is a novel method for 3D view synthesis, which can gain an implicit neural learning rendering result than the traditional neural rendering technology but keep the more high-definition fast rendering speed. But it is still difficult to achieve a fast enough efficiency on 3D Gaussian Splatting for the practical applications. To Address this issue, we propose the I3DS, a synthetic model performance improvement evaluation solution and experiments test. From multiple and important levels or dimensions of the original 3D Gaussian Splatting, we made more than two thousand various kinds of experiments to test how the selected different items and components can make an impact on the training efficiency of the 3D Gaussian Splatting model. In this paper, we will share abundant and meaningful experiences and methods about how to improve the training, performance and the impacts caused by different items of the model. A special but normal Integer compression in base 95 and a floating-point compression in base 94 with ASCII encoding and decoding mechanism is presented. Many real and effective experiments and test results or phenomena will be recorded. After a series of reasonable fine-tuning, I3DS can gain excellent performance improvements than the previous one. The project code is available as open source.	This paper proposes I3DS, a method to improve the training efficiency of 3D Gaussian Splatting models for 3D view synthesis.	3D Gaussian Splatting is a promising technique for 3D view synthesis, offering high resolution and fast rendering speed. However, its training efficiency requires improvement for practical applications.	The paper explores improvements from multiple dimensions: analyzing the impact of color components and backgrounds, optimizing learning rates, and introducing data compression techniques.	Removing color components during training significantly improves speed (5-8x) but adding them back is challenging and requires further investigation. Setting the maximum degree of Spherical Harmonics coefficients to 0 speeds up training (16.57% improvement) without significantly impacting the rendering quality. Customizing learning rates, particularly the XYZ learning rate and scaling learning rate, demonstrably enhances training speed.	Adding back color information after training without color is a significant challenge and requires further research to achieve satisfactory results. The proposed ASCII encoding-decoding compression offers modest speed improvements (around 6%) and further optimization is needed for compressing color matrices.	3d gaussian splatting, 3d view synthesis, training efficiency, spherical harmonics, data compression
2405.06241 Report	MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization	Pengcheng Zhu, Yaoming Zhuang, Baoquan Chen, Li Li, Chengdong Wu, Zhanlin Liu	This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.	Introduces MGS-SLAM, a novel monocular dense SLAM system that combines sparse visual odometry with 3D Gaussian Splatting for the first time.	Addresses limitations of existing Gaussian Splatting-based SLAM systems that rely on RGB-D input and suffer from weak tracking, enabling dense mapping with only RGB images.	Integrates sparse visual odometry (DPVO) with a dense Gaussian Splatting scene representation. Employs a pre-trained MVS depth estimation network to bridge the two components and proposes a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to ensure geometric accuracy and scale consistency.	Achieves state-of-the-art pose estimation accuracy, outperforming previous monocular and some RGB-D methods on TUM and Replica datasets. Demonstrates robust tracking on large-scale datasets like Replica, unlike previous monocular Gaussian Splatting-based SLAM systems. Produces high-fidelity novel view synthesis results, comparable to neural SLAM systems using RGB-D input.	Real-time performance is still limited compared to some traditional SLAM methods. Further research on incorporating loop closure and global optimization techniques could improve performance in challenging scenarios.	slam, gaussian splatting, monocular vision, dense mapping, differentiable rendering
2405.06147 Report	State-Free Inference of State-Space Models: The Transfer Function Approach	Rom N. Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Ré, Hajime Asama, Stefano Ermon, Taiji Suzuki, Atsushi Yamashita, Michael Poli	We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed algorithms, state-free inference does not incur any significant memory or computational cost with an increase in state size. We achieve this using properties of the proposed frequency domain transfer function parametrization, which enables direct computation of its corresponding convolutional kernel's spectrum via a single Fast Fourier Transform. Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers -- parametrized in time-domain -- on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. Moreover, we report improved perplexity in language modeling over a long convolutional Hyena baseline, by simply introducing our transfer function parametrization. Our code is available at https://github.com/ruke1ire/RTF.	Presents Rational Transfer Function (RTF), a novel parametrization of state-space models (SSM) for sequence processing based on a frequency domain representation.	Addresses limitations of existing SSMs like restricted expressiveness due to diagonal state transition matrices and high memory cost in parallel scan-based inference.	Leverages the transfer function, the dual of impulse response, to design a state-free parallel inference algorithm based on the Fast Fourier Transform (FFT).	Achieves state-of-the-art accuracy among attention-free models on the Long Range Arena benchmark. Demonstrates faster training speed compared to S4 and S4D layers across different state sizes. Shows improved perplexity over a long convolutional Hyena baseline in language modeling by introducing the transfer function parametrization.	RTF with small state sizes struggled to learn a policy beyond random guessing on the Path-X task of the LRA benchmark. Directly training RTF on language modeling exhibited instability issues, necessitating the use of parameter constraints and specific initialization schemes.	state-space model, transfer function, sequence modeling, parallel inference, frequency domain
2405.05967 Report	Distilling Diffusion Models into Conditional GANs	Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park	We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models - DMD, SDXL-Turbo, and SDXL-Lightning - on the zero-shot COCO benchmark.	This paper introduces Diffusion2GAN, a method to distill a multi-step diffusion model into a single-step conditional GAN, accelerating inference while preserving image quality.	Diffusion models excel in image synthesis but suffer from slow inference due to multi-step sampling. This work addresses this limitation for real-time applications.	The method interprets distillation as paired image-to-image translation using noise-image pairs from the diffusion ODE trajectory. It leverages E-LatentLPIPS, a proposed efficient perceptual loss in latent space, and a multi-scale conditional diffusion discriminator.	Diffusion2GAN outperforms one-step diffusion distillation models like DMD, SDXL-Turbo, and SDXL-Lightning on zero-shot COCO benchmark. The proposed E-LatentLPIPS significantly accelerates training and improves performance compared to pixel-based perceptual losses. The multi-scale diffusion discriminator, initialized from a pre-trained diffusion model, further enhances image quality and text alignment.	The current method uses a fixed classifier-free guidance scale, limiting control over text adherence. The performance of the distilled model is limited by the quality of the teacher diffusion model.	diffusion models, generative adversarial networks, knowledge distillation, image synthesis, text-to-image generation
2405.05953 Report	Frame Interpolation with Consecutive Brownian Bridge Diffusion	Zonglin Lyu, Ming Li, Jianbo Jiao, Chen Chen	Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.	This paper introduces a novel consecutive Brownian Bridge diffusion model for Video Frame Interpolation (VFI), aiming to address the deterministic output requirement of VFI, which is not fulfilled by the random generation nature of traditional Latent Diffusion Models (LDMs).	VFI requires deterministic output for a given input frame pair, while traditional LDMs generate diverse images due to high cumulative variance in the generation process, leading to difficulties in accurate intermediate frame interpolation.	This work formulates VFI as a two-stage process: autoencoder and ground truth estimation. It proposes consecutive Brownian Bridge diffusion, which transits among three deterministic endpoints (previous, intermediate, next frames) to minimize cumulative variance. An autoencoder with warped feature pyramids from neighboring frames further enhances detail preservation in the generated frames.	The proposed consecutive Brownian Bridge diffusion model demonstrates superior ground truth estimation compared to traditional conditional generation diffusion models. The improved autoencoder effectively reduces overlaid image artifacts commonly observed in previous LDM-based VFI methods. The method achieves state-of-the-art performance on standard VFI benchmarks, particularly excelling in motion consistency metrics like FloLPIPS.	The current method utilizes a bisection-like approach for multi-frame interpolation, limiting its ability to directly interpolate arbitrary time steps. Future work could explore more sophisticated autoencoder architectures or diffusion model designs to further enhance interpolation quality.	video frame interpolation, diffusion models, brownian bridge, autoencoder, conditional image generation
2405.05949 Report	CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen	Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.	CuMo enhances multimodal Large Language Models (LLMs) by incorporating co-upcycled sparsely-gated Mixture-of-Experts (MoE) blocks into the vision encoder and MLP connector.	Scaling multimodal LLMs via increasing data and model size is computationally expensive. CuMo improves efficiency by enhancing visual capabilities with minimal additional parameters during inference.	CuMo employs a three-stage training process: MLP connector pre-training, pre-finetuning the whole model, and visual instruction tuning with co-upcycled MoE blocks. Auxiliary losses ensure balanced expert loading.	Outperforms state-of-the-art multimodal LLMs on various VQA and visual-instruction-following benchmarks. Achieves performance comparable to larger models while using a smaller LLM size. Demonstrates effectiveness of co-upcycled MoE blocks and training recipe through ablation studies.	Hallucinations observed in responses, requiring further investigation for mitigation. Limited exploration of scaling vision encoders beyond the used CLIP architecture.	multimodal llm, mixture-of-experts, vision-language model, co-upcycling, visual instruction tuning
2405.05945 Report	Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers	Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li	Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.	Introduces Lumina-T2X, a unified framework based on Flow-based Large Diffusion Transformers (Flag-DiT) for generating various modalities (images, videos, 3D objects, audio) from text at arbitrary resolutions and lengths.	Addresses limitations of previous models like Sora and Stable Diffusion 3 by providing a unified framework, detailed implementation instructions, and publicly available pre-trained checkpoints.	Employs Flag-DiT, incorporating improvements like RoPE, RMSNorm, KQ-norm, and flow matching formulation for scalability and stability. Utilizes learnable placeholders like '[nextline]' and '[nextframe]' tokens for handling arbitrary resolutions and lengths.	Flag-DiT significantly outperforms existing models on ImageNet benchmark, demonstrating faster convergence and higher sample quality with increasing model size. Lumina-T2I achieves superior visual quality and text alignment in image generation, enabling resolution extrapolation, style-consistent generation, compositional generation, and high-resolution editing in a training-free manner. Lumina-T2V, Lumina-T2MV, and Lumina-T2Speech show promising preliminary results in generating temporally and spatially consistent videos, multi-view 3D objects, and speech from text prompts, respectively.	Current version trains each modality separately due to data imbalance and diverse latent space distributions, hindering joint learning. Limited data coverage leads to challenges in generating complex real-world details, such as human hands or intricate scenes.	diffusion models, text-to-image generation, text-to-video generation, multi-modal generation, resolution extrapolation
2405.05858 Report	Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera	Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl	We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.	This paper presents a novel approach for reconstructing and estimating the pose of a rigid, dynamic object from a monocular RGB video, without relying on prior information like hand poses, object categories, or scene geometry.	Existing methods for 3D object reconstruction often rely on restrictive assumptions such as static objects, hand-held rotations, or prior knowledge of object categories, limiting their applicability in general scenarios with free-moving objects.	The method leverages a virtual camera system guided by 2D object masks to simplify the optimization process. It optimizes object shape and pose progressively with a single network, using a simplified 4-DOF pose representation and incorporating 2D matches between frames. Finally, it refines the results in the real camera coordinate system using a PnP solver.	The method outperforms existing pose-free methods on the HO3D dataset and achieves comparable performance to methods relying on ground-truth poses or depth information. It effectively handles free-moving objects and generalizes well to egocentric sequences captured with a head-mounted device. The proposed virtual camera system is shown to significantly improve optimization stability and accuracy.	The method struggles with objects that are heavily occluded for extended periods during capture. Reconstruction of small, texture-less objects remains challenging due to the lack of distinctive features.	3d reconstruction, pose estimation, virtual camera, implicit neural representation, monocular rgb video
2405.05846 Report	Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models	Zhe Ma, Xuhong Zhang, Qingming Li, Tianyu Du, Wenzhi Chen, Zonghui Wang, Shouling Ji	The past few years have witnessed substantial advancement in text-guided image generation powered by diffusion models. However, it was shown that text-to-image diffusion models are vulnerable to training image memorization, raising concerns on copyright infringement and privacy invasion. In this work, we perform practical analysis of memorization in text-to-image diffusion models. Targeting a set of images to protect, we conduct quantitive analysis on them without need to collect any prompts. Specifically, we first formally define the memorization of image and identify three necessary conditions of memorization, respectively similarity, existence and probability. We then reveal the correlation between the model's prediction error and image replication. Based on the correlation, we propose to utilize inversion techniques to verify the safety of target images against memorization and measure the extent to which they are memorized. Model developers can utilize our analysis method to discover memorized images or reliably claim safety against memorization. Extensive experiments on the Stable Diffusion, a popular open-source text-to-image diffusion model, demonstrate the effectiveness of our analysis method.	This paper presents a practical, image-based method for analyzing and measuring memorization in text-to-image diffusion models, aiming to help developers identify and address potential copyright and privacy risks.	Memorization in text-to-image models raises significant concerns about copyright infringement and privacy violation, as these models can potentially replicate images from their training data.	The authors define three necessary conditions for memorization: similarity, existence, and probability. They leverage the model's prediction error as a metric for image replication and propose prompt and noise inversion techniques to analyze the existence and probability of memorization for target images.	The model's prediction error is highly correlated with image replication, providing a reliable metric for identifying memorized images. Unconditional diffusion models trained on large-scale datasets show resilience against memorization and can serve as a baseline for measuring memorization in conditional models. The proposed method can effectively quantify the extent of memorization for a given image, enabling developers to assess and address potential risks.	The hard prompt inversion algorithm, while more effective than existing methods, needs improvement for higher accuracy and applicability to a wider range of memorized images. Future work should extend the analysis to different conditional diffusion models beyond text-to-image generation and explore corresponding regularization techniques.	memorization, text-to-image diffusion models, privacy, copyright, inversion techniques
2405.05806 Report	MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation	Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo	Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver.	Proposes MasterWeaver, a tuning-free method for personalized text-to-image generation that balances identity fidelity and editability.	Existing methods struggle to balance faithful identity preservation with flexible control over attributes and context in generated images.	Uses an encoder to extract identity features from a reference image, injects these features into a Stable Diffusion model via cross-attention, and employs an editing direction loss and face-augmented dataset during training to enhance editability.	Generates high-quality personalized images with faithful identity preservation in diverse scenarios. Demonstrates superior text controllability compared to state-of-the-art methods, enabling flexible editing of attributes, clothing, background, and style. Achieves competitive inference speed, generating an image in 4 seconds on a single V100 GPU.	Limited ability to generate images with multiple personalized identities. Challenges in achieving precise control over attributes due to the coarse granularity of text representations.	text-to-image generation, personalized image synthesis, identity preservation, text controllability, diffusion models
2405.05800 Report	DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation	Sitian Shen, Jing Xu, Yuheng Yuan, Xingyi Yang, Qiuhong Shen, Xinchao Wang	User-friendly 3D object editing is a challenging task that has attracted significant attention recently. The limitations of direct 3D object editing without 2D prior knowledge have prompted increased attention towards utilizing 2D generative models for 3D editing. While existing methods like Instruct NeRF-to-NeRF offer a solution, they often lack user-friendliness, particularly due to semantic guided editing. In the realm of 3D representation, 3D Gaussian Splatting emerges as a promising approach for its efficiency and natural explicit property, facilitating precise editing tasks. Building upon these insights, we propose DragGaussian, a 3D object drag-editing framework based on 3D Gaussian Splatting, leveraging diffusion models for interactive image editing with open-vocabulary input. This framework enables users to perform drag-based editing on pre-trained 3D Gaussian object models, producing modified 2D images through multi-view consistent editing. Our contributions include the introduction of a new task, the development of DragGaussian for interactive point-based 3D editing, and comprehensive validation of its effectiveness through qualitative and quantitative experiments.	DragGaussian, a novel 3D object drag-editing framework based on 3D Gaussian Splatting that leverages diffusion models for interactive editing with open-vocabulary input.	Existing 3D editing methods using 2D generative models often lack user-friendliness due to their reliance on semantic guided editing. DragGaussian addresses this by enabling intuitive drag-based editing on 3D Gaussian models.	DragGaussian uses a user interface for drag point selection, projects them onto multi-view 2D images, employs a fine-tuned multi-view diffusion model for consistent editing, and refines the original 3D Gaussian model with the edited 2D images.	DragGaussian enables interactive point-based manipulation of 3D Gaussian objects. Multi-view consistent editing ensures coherent modifications across different viewpoints. Fine-tuning the diffusion model with multi-view LoRA enhances identity preservation and editing accuracy.	Reliance on diffusion models prevents real-time editing. Constraints of the pre-trained MVDream network may limit the quality of editing results on certain datasets.	3d object editing, 3d gaussian splatting, diffusion models, multi-view consistency, drag-based editing
2405.05768 Report	FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting	Yikun Ma, Dandan Zhan, Zhi Jin	Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.	This paper presents FastScene, a novel framework for fast and high-quality text-driven 3D indoor scene generation that prioritizes scene consistency.	Existing methods for 3D indoor scene generation are either slow, require manual specification of motion parameters, or struggle to maintain global consistency. Fast and high-fidelity scene generation is crucial for user-friendly experiences in various applications like gaming, smart homes, and AR/VR.	FastScene first generates a panorama from a text prompt and estimates its depth. It then uses Coarse View Synthesis (CVS) to generate novel panoramic views with holes, which are filled using Progressive Novel View Inpainting (PNVI) on cubemap representations. Finally, Multi-View Projection (MVP) converts panoramas to perspective views for 3D Gaussian Splatting (3DGS) reconstruction.	FastScene outperforms existing methods in terms of both generation speed and visual quality, while maintaining better scene consistency. FastScene can generate a 3D scene from a text prompt in just 15 minutes, at least one hour faster than state-of-the-art methods. The proposed PNVI and MVP techniques are adaptable to existing panoramic datasets, enabling high-quality novel view synthesis and 3D reconstruction from various sources.	The reliance on depth estimation accuracy can affect the quality of novel view synthesis. Future work includes exploring 3D scene editing capabilities and incorporating multimodal learning.	3d scene generation, text-to-3d, novel view synthesis, panorama, 3d gaussian splatting
2405.05702 Report	NGM-SLAM: Gaussian Splatting SLAM with Radiance Field Submap	Mingrui Li, Jingwei Huang, Lei Sun, Aaron Xuxiang Tian, Tianchen Deng, Hongyu Wang	SLAM systems based on Gaussian Splatting have garnered attention due to their capabilities for rapid real-time rendering and high-fidelity mapping. However, current Gaussian Splatting SLAM systems usually struggle with large scene representation and lack effective loop closure detection. To address these issues, we introduce NGM-SLAM, the first 3DGS based SLAM system that utilizes neural radiance field submaps for progressive scene expression, effectively integrating the strengths of neural radiance fields and 3D Gaussian Splatting. We utilize neural radiance field submaps as supervision and achieve high-quality scene expression and online loop closure adjustments through Gaussian rendering of fused submaps. Our results on multiple real-world scenes and large-scale scene datasets demonstrate that our method can achieve accurate hole filling and high-quality scene expression, supporting monocular, stereo, and RGB-D inputs, and achieving state-of-the-art scene reconstruction and tracking performance.	This paper introduces NGM-SLAM, a novel dense Gaussian splatting SLAM system that leverages neural radiance field submaps for progressive scene representation, effectively addressing limitations of current 3DGS-SLAM systems in handling large scenes and loop closures.	Current Gaussian Splatting SLAM systems often struggle with representing extensive scenes and lack robust loop closure detection, hindering their applicability in large-scale environments. This work aims to overcome these limitations and improve the performance of dense SLAM systems.	The proposed NGM-SLAM system employs neural radiance field submaps as priors for 3D Gaussian rendering, progressively constructing the scene. It incorporates a local-to-global loop closure detection and optimization process, utilizing submaps for supervision and achieving real-time error correction during mapping.	NGM-SLAM demonstrates superior performance compared to state-of-the-art NeRF/GS-based SLAM methods in terms of rendering and tracking accuracy on various datasets, including Replica, ScanNet, TUM RGB-D, and EuRoC. The system effectively addresses the issue of scene gaps by leveraging neural submaps for guidance, achieving more complete and detailed scene reconstruction compared to methods relying solely on 3DGS. NGM-SLAM exhibits robust tracking and reconstruction capabilities in large-scale scenes, effectively mitigating drift through its loop closure mechanism and enabling real-time operation even with limited computational resources.	Limited real-time reconstruction ability in extremely large-scale environments like city-level scenarios due to current memory and computational constraints. Future work could explore porting the system to CUDA programming for enhanced mesh extraction and higher-quality mesh generation.	slam, 3d gaussian splatting, neural radiance fields, loop closure, dense reconstruction
2405.05691 Report	StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework	Yiheng Huang, Hui Yang, Chuanchen Luo, Yuxi Wang, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, Junran Peng	Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page: https://h-y1heng.github.io/StableMoFusion-page/	Presents StableMoFusion, a robust and efficient diffusion-based motion generation framework that leverages Conv1D UNet architecture and novel training/inference strategies.	Addresses limitations of existing diffusion-based motion generation methods, including lack of systematic analysis, long inference time, and foot skating issues.	Conducts comprehensive analysis of network architectures, training strategies, and inference processes, incorporating efficient samplers, text caching, parallel CFG computation, low-precision inference, and a footskate cleanup mechanism.	Achieves state-of-the-art results in FID and R-Precision on HumanML3D dataset. Significantly reduces inference time compared to previous methods, achieving an average of 0.5 seconds per motion. Effectively mitigates foot skating issues in generated motions.	Current inference speed, while improved, does not yet meet real-time industry standards. Future work will focus on further acceleration through model scaling and reducing single-step latency.	motion generation, diffusion models, text-to-motion, efficient inference, footskate cleanup
2405.05663 Report	RPBG: Towards Robust Neural Point-based Graphics in the Wild	Qingtian Zhu, Zizhuang Wei, Zhongtian Zheng, Yifan Zhan, Zhuyu Yao, Jiawang Zhang, Kejian Wu, Yinqiang Zheng	Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, which are challenging to handle but defacto common in real applications. To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). We in-depth analyze the factors that prevent NPBG from achieving satisfactory renderings on generic datasets, and accordingly reform the pipeline to make it more robust to varying datasets in-the-wild. Inspired by the practices in image restoration, we greatly enhance the neural renderer to enable the attention-based correction of point visibility and the inpainting of incomplete rasterization, with only acceptable overheads. We also seek for a simple and lightweight alternative for environment modeling and an iterative method to alleviate the problem of poor geometry. By thorough evaluation on a wide range of datasets with different shooting conditions and camera trajectories, RPBG stably outperforms the baseline by a large margin, and exhibits its great robustness over state-of-the-art NeRF-based variants. Code available at https://github.com/QT-Zhu/RPBG.	This paper presents RPBG (Robust Point-based Graphics), a novel method for robust neural point-based re-rendering that enhances the existing NPBG method to handle generic, in-the-wild datasets.	Existing point-based neural re-rendering methods, despite their advantages like intuitive representation and faster convergence, struggle with noisy and patchy point clouds and unbounded scenes common in real-world applications. This work aims to address these limitations and enhance robustness for wider applicability.	RPBG improves upon NPBG by: (1) Introducing a Downgrade-aware Convolution (DAC) module in the neural renderer to accurately determine point visibility and inpaint incomplete rasterizations. (2) Using a lightweight, trainable feature vector for environment modeling instead of a computationally expensive environment map. (3) Employing a point cloud augmentation technique using pseudo densities to refine poorly triangulated point clouds. (4) Implementing a collaborative end-to-end optimization of neural textures and renderer parameters.	RPBG significantly outperforms the baseline NPBG and achieves state-of-the-art performance on various datasets with challenging scene types, including unbounded, inside-out, large-scale, and sparse-view scenes. RPBG exhibits robustness and generalizability by achieving high-quality renderings across diverse datasets using a single set of hyperparameters, eliminating the need for per-scene tuning. The method proves to be computationally efficient and scalable, handling large-scale scenes with limited memory compared to some existing point-based methods.	While computationally efficient in terms of rendering, RPBG requires more storage for CNN parameters and neural textures compared to lightweight RF-based methods. The enhanced context exchange among points achieved by the DAC module can slightly decrease the editability of individual points in the scene.	point-based graphics, novel view synthesis, neural rendering, 3d reconstruction, robustness
2405.05615 Report	Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning	Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang	Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods. Code: https://github.com/JieShibo/MemVP	This paper introduces Memory-Space Visual Prompting (MemVP), a novel approach for efficient vision-language (VL) fine-tuning that integrates visual prompts as knowledge into the weights of Feed-Forward Networks (FFNs) in language models.	Existing VL fine-tuning methods often extend input length with visual prompts, leading to inefficiency. MemVP addresses this limitation by treating visual prompts as external knowledge, injecting them directly into the memory space of language models.	MemVP projects image features into visual prompts, adds positional embeddings, and concatenates them with the FFN weight matrices. This enables retrieval of visual knowledge during text generation without increasing input length.	MemVP outperforms previous Parameter-Efficient Fine-Tuning (PEFT) baselines on various VL benchmarks, including VQAv2, GQA, ScienceQA, and COCO Captions. MemVP significantly reduces training time and inference latency compared to input-space visual prompting methods. Visualization experiments confirm that MemVP successfully injects visual knowledge into language model memory, enabling retrieval of relevant visual information during text generation.	The inference speed advantage of MemVP is less pronounced for generating long texts, as its main impact is during the generation of the first token. MemVP might inherit drawbacks of pre-trained language models, such as inherent biases, misinformation, or potential copyright violation.	vision-language models, parameter-efficient fine-tuning, visual prompting, feed-forward networks, knowledge injection
2405.05538 Report	A Survey on Personalized Content Synthesis with Diffusion Models	Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li	Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper offers a comprehensive survey of PCS, with a particular focus on the diffusion models. Specifically, we introduce the generic frameworks of PCS research, which can be broadly classified into optimization-based and learning-based approaches. We further categorize and analyze these methodologies, discussing their strengths, limitations, and key techniques. Additionally, we delve into specialized tasks within the field, such as personalized object generation, face synthesis, and style personalization, highlighting their unique challenges and innovations. Despite encouraging progress, we also present an analysis of the challenges such as overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to advance the development of PCS.	This paper presents a comprehensive survey of Personalized Content Synthesis (PCS) with diffusion models, focusing on generating customized images from user-provided references and text prompts.	PCS is rapidly growing, with over 150 methods proposed in two years, highlighting its significance in content creation and the need for a consolidated overview.	The paper categorizes PCS methods into optimization-based and learning-based approaches, analyzing their strengths, limitations, and key techniques. It further examines specialized tasks like personalized object and face generation, style transfer, and multi-subject composition.	Optimization-based methods excel in subject fidelity but require fine-tuning for each subject, leading to high storage demands. Learning-based methods offer fast inference without fine-tuning but often struggle to capture fine-grained details and might exhibit limited generalization ability. Despite progress, PCS still faces challenges such as overfitting on limited references, balancing subject fidelity with text alignment, and the lack of standardized evaluation metrics and datasets.	The overfitting problem in PCS, particularly for non-rigid subjects or semantically similar backgrounds, requires further investigation and solutions. Achieving a balance between high subject fidelity and flexible text alignment remains a challenge, demanding innovative model architectures and training strategies.	generative models, diffusion models, personalized content synthesis, image generation, text-to-image synthesis
2405.05446 Report	GDGS: Gradient Domain Gaussian Splatting for Sparse Representation of Radiance Fields	Yuanhao Gong	The 3D Gaussian splatting methods are getting popular. However, they work directly on the signal, leading to a dense representation of the signal. Even with some techniques such as pruning or distillation, the results are still dense. In this paper, we propose to model the gradient of the original signal. The gradients are much sparser than the original signal. Therefore, the gradients use much less Gaussian splats, leading to the more efficient storage and thus higher computational performance during both training and rendering. Thanks to the sparsity, during the view synthesis, only a small mount of pixels are needed, leading to much higher computational performance ($100\sim 1000\times$ faster). And the 2D image can be recovered from the gradients via solving a Poisson equation with linear computation complexity. Several experiments are performed to confirm the sparseness of the gradients and the computation performance of the proposed method. The method can be applied various applications, such as human body modeling and indoor environment modeling.	This paper proposes GDGS, a novel gradient domain Gaussian splatting method for sparse radiance field representation.	Existing 3D Gaussian splatting methods, while popular, struggle with dense signal representation, impacting storage and computational efficiency. Gradient domain processing offers a sparser representation, potentially addressing these limitations.	The method involves three steps: 1) approximating the Laplacian field of the signal with Gaussian splats, 2) projecting these splats onto the image plane to obtain the 2D Laplacian field, and 3) reconstructing the image from this field by solving a Poisson equation using a U-Net architecture.	GDGS achieves higher accuracy (0.6-1dB PSNR improvement) compared to original 3D Gaussian splatting. The gradient domain representation in GDGS results in significantly sparser representations (up to 100 times fewer particles). The sparsity leads to faster rendering speeds due to reduced computational demands.	The PSNR improvement can be influenced by factors like image resolution, scene complexity, and lighting conditions. Future work could explore integrating techniques like importance sampling and adaptive splatting to further enhance efficiency.	gaussian splatting, gradient domain, radiance fields, sparse representation, view synthesis
2405.05252 Report	Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models	Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu	Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. Project webpage: https://atedm.github.io.	This paper introduces AT-EDM, a training-free framework for accelerating diffusion models (DMs) by pruning redundant tokens in attention blocks during run-time, leveraging attention map information.	DMs excel at image generation but are computationally expensive, hindering their application on resource-constrained devices. Existing efficiency methods rely on retraining, which is computationally costly and inflexible for diverse deployment settings. This work offers a training-free approach, enabling dynamic and efficient DM acceleration without retraining.	The method uses a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), derived from attention maps to identify and prune redundant tokens within each denoising step. To maintain image quality, a similarity-based token recovery method utilizes attention map information to restore pruned tokens for convolution operations. Furthermore, a Denoising-Steps-Aware Pruning (DSAP) approach adjusts the pruning ratio across denoising steps based on attention map variance analysis, preserving crucial information in early steps.	AT-EDM achieves comparable image quality with a 38.8% FLOPs reduction compared to the full Stable Diffusion XL (SD-XL) model. It outperforms the state-of-the-art training-free method, ToMe, in terms of both FID (image quality) and CLIP (text-image alignment) scores under various FLOPs budgets. The DSAP schedule is shown to improve image quality significantly and is generalizable to other run-time acceleration techniques.	The performance of AT-EDM is inherently upper bounded by the full-sized pre-trained model. Accessing attention maps in efficiently implemented DMs might require additional computation due to fused operations in attention libraries.	diffusion models, model compression, training-free acceleration, attention mechanism, text-to-image generation
2405.05224 Report	Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation	Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet	Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.	\methodname~is a novel distillation framework for text-to-image diffusion models that enables high-fidelity image generation in just one to three steps.	Diffusion models are powerful but computationally expensive. Existing acceleration methods often sacrifice quality or struggle with complex prompts in ultra-low step regimes.	The framework uses three key components: (i) Backward Distillation to align training and inference, (ii) Shifted Reconstruction Loss (\lossname) to dynamically transfer knowledge from teacher to student, and (iii) Noise Correction for improved initial sample quality.	\methodname~achieves comparable quality to the baseline Emu model using only three steps. It outperforms state-of-the-art distillation methods (Step Distillation, LCM, ADD) in FID and CLIP scores. Human evaluations show a clear preference for \methodname~generated images over ADD and Lightning.	Human evaluation, while extensive, is subjective and may vary with different prompts and annotators. Like other text-to-image models, there's a risk of generating biased or offensive content despite efforts to ensure fairness and safety.	generative ai, efficient diffusion, image synthesis, text-to-image, distillation
2405.05216 Report	FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models	Jinglin Xu, Yijie Guo, Yuxin Peng	The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.	This paper proposes FinePOSE, a novel fine-grained prompt-driven denoiser based on a diffusion model for 3D human pose estimation.	Existing 3D HPE methods struggle with depth ambiguity, human body complexity, and generalizing to diverse actions. FinePOSE addresses these challenges by leveraging accessible texts and natural human knowledge to guide pose estimation.	FinePOSE uses three core blocks: 1) Fine-grained Part-aware Prompt learning (FPP) to construct prompts capturing action class, body part movements, and kinematic information, 2) Fine-grained Prompt-pose Communication (FPC) to enhance denoising by injecting prompt embedding into noisy poses, and 3) Prompt-driven Timestamp Stylization (PTS) for adaptive adjustment at each denoising step using prompt embedding and noise level.	FinePOSE achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP datasets for 3D human pose estimation. Fine-grained part-aware prompt learning significantly improves denoising quality and estimation accuracy. The proposed method shows promising results on multi-human pose estimation using a post-integration strategy.	FinePOSE is not specifically designed for multi-person scenarios. The diffusion model-based approach is computationally expensive.	3d human pose estimation, diffusion models, prompt learning, denoising, computer vision
2405.05173 Report	A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion Perspective	Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, Lap-Pui Chau	3D occupancy perception technology aims to observe and understand dense 3D environments for autonomous vehicles. Owing to its comprehensive perception capability, this technology is emerging as a trend in autonomous driving perception systems, and is attracting significant attention from both industry and academia. Similar to traditional bird's-eye view (BEV) perception, 3D occupancy perception has the nature of multi-source input and the necessity for information fusion. However, the difference is that it captures vertical structures that are ignored by 2D BEV. In this survey, we review the most recent works on 3D occupancy perception, and provide in-depth analyses of methodologies with various input modalities. Specifically, we summarize general network pipelines, highlight information fusion techniques, and discuss effective network training. We evaluate and analyze the occupancy perception performance of the state-of-the-art on the most popular datasets. Furthermore, challenges and future research directions are discussed. We hope this paper will inspire the community and encourage more research work on 3D occupancy perception. A comprehensive list of studies in this survey is publicly available in an active repository that continuously collects the latest work: https://github.com/HuaiyuanXu/3D-Occupancy-Perception.	This paper presents a comprehensive survey of recent advancements in 3D occupancy perception for autonomous driving, focusing on the crucial role of information fusion.	3D occupancy perception provides a dense, 3D understanding of the environment, surpassing traditional BEV perception by capturing height information. It facilitates a range of downstream applications in autonomous driving, like object detection, tracking, and motion planning.	The survey categorizes occupancy perception methods based on input modalities: LiDAR-centric, vision-centric, and multi-modal. It dissects core methodological issues including network pipelines, spatial and temporal information fusion techniques, and training strategies (strong, weak, semi, and self-supervised learning).	LiDAR-centric methods currently achieve higher accuracy than vision-centric approaches due to precise depth information from LiDAR. Vision-centric occupancy perception is rapidly advancing, driven by the cost-effectiveness of cameras and advancements in deep learning. Multi-modal occupancy perception shows promising potential but requires further research to fully leverage the benefits of fusing different data modalities.	Current occupancy methods struggle to achieve real-time performance for deployment on autonomous driving systems. Robustness and generalization of occupancy perception models in complex, real-world scenarios remain open challenges.	autonomous driving, occupancy perception, information fusion, lidar, computer vision
2405.05027 Report	StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer	Zijia Wang, Zhi-Song Liu	We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.	This paper presents et, an efficient text-driven image style transfer framework that leverages a conditional State Space Model within an AutoEncoder architecture to rapidly translate text prompts into corresponding visual styles while preserving content integrity.	Existing text-guided stylization methods are computationally expensive, requiring hundreds of training iterations and significant GPU resources. et addresses this limitation by significantly speeding up the process through an innovative framework and novel loss functions.	et utilizes a pretrained VAE for encoding and decoding, a Style Fusion Module with AdaLN and Mamba for efficient style fusion, and a SigLIP Module for enhanced text-image alignment. It introduces masked and second-order directional losses to expedite training convergence and improve style fidelity.	et demonstrates superior performance over state-of-the-art methods in terms of stylization quality, content preservation, and computational efficiency, achieving significant speedups in both training and inference. Ablation studies confirm the effectiveness of the proposed Style Fusion Module, the use of SigLIP for text-image alignment, and the impact of the novel loss functions on stylization quality and training speed. et exhibits strong generalization capabilities, extending its applications to diverse creative domains, including multiple style transfer, product design, painting assistance, UI design, cinematic style transformation, and fashion design.	While et excels in many areas, it currently exhibits limitations in understanding content-guided or less commonly used text prompts, particularly in scenarios involving face editing or object manipulation. Future research will focus on addressing these limitations by improving the model's handling of diverse facial features and expanding its comprehension of novel and abstract concepts for enhanced style transfer capabilities.	text-driven image style transfer, state space model, autoencoder, masked directional loss, second-order directional loss
2405.05010 Report	${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields	Ning Wang, Lefei Zhang, Angel X Chang	Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (${M^2D}$NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-modal feature distillation to integrate teacher features from pretrained visual and language models into 3D semantic feature volumes, thereby facilitating consistent 3D editing. To enforce consistency between the visual and language features in our 3D feature volumes, we introduce a multi-modal similarity constraint. We also introduce a patch-based joint contrastive loss that helps to encourage object-regions to coalesce in the 3D feature space, resulting in more precise boundaries. Experiments on various real-world scenes show superior performance in 3D scene decomposition tasks compared to prior NeRF-based methods.	This paper proposes \fullM (\M), a novel NeRF-based method that uses multi-modal feature distillation to enable both text-based and visual patch-based 3D scene decomposition.	NeRF-based 3D scene decomposition often lacks object-level awareness and struggles with semantic ambiguity at object boundaries. Existing methods either rely on expensive 3D annotations or have difficulty generalizing to real-world scenes. This paper leverages the power of pretrained foundation models to enable more accurate and flexible decomposition for real-world scenes.	The \M model extends the NeRF model with visual and language feature branches, trained via multi-modal feature distillation using DINO and CLIP-LSeg as teacher models, respectively. It further introduces a multi-modal similarity constraint and a patch-based joint contrastive loss to encourage consistency and distinct boundaries between objects.	Outperforms existing distillation-based scene decomposition methods (DFF and N3F) in both quantitative and qualitative evaluation on the LLFF dataset. Achieves comparable segmentation performance to annotation-based NeRF-SOS. Supports both image patch and text queries for flexible object extraction and editing.	The density-based representations can lead to noise in the decomposition. Lacks 3D inpainting capabilities, potentially struggling with scenes containing occlusions or missing parts.	neural radiance fields, 3d scene decomposition, multi-modal learning, feature distillation, contrastive learning
2405.04834 Report	FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation	Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang	Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.	The paper introduces FlexEControl, a novel approach for multimodal control in text-to-image generation that enhances efficiency without compromising the flexibility or controllability of existing methods.	Existing methods for incorporating structural conditions in text-to-image generation often require extensive training, leading to inefficiencies in model development and deployment. FlexEControl addresses this limitation by enabling efficient training while maintaining controllability and flexibility.	FlexEControl builds upon the architecture of Uni-ControlNet, incorporating a multi-scale condition injection strategy with learnable convolutional layers. It leverages pretrained Stable Diffusion weights and employs efficient training techniques.	FlexEControl achieves comparable or superior performance compared to state-of-the-art baselines like T2I-Adapter, PHM, Uni-ControlNet, and LoRA across various structural conditions including edge maps, sketches, pose information, depth maps, and segmentation maps. The method demonstrates a significant reduction in training time and computational resources, enhancing efficiency without sacrificing performance. FlexEControl excels in generating images that adhere to both textual prompts and structural conditions, showcasing its efficacy in multimodal control for text-to-image generation.	The paper acknowledges the potential limitations of the selected structural condition extraction methods, which may impact the overall performance. Future work aims to explore more advanced architectures and training techniques to further enhance the efficiency and controllability of the approach.	text-to-image generation, multimodal control, efficient training, stable diffusion, structural conditions
2405.04682 Report	TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation	Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang	Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.	The paper proposes Time-Aligned Captions (TALC), a framework for generating multi-scene videos from text using pre-trained text-to-video diffusion models by conditioning parts of the video on corresponding scene descriptions.	Most existing text-to-video models struggle to generate coherent multi-scene videos, limiting their applicability to real-world scenarios where such videos are common.	TALC modifies the text conditioning mechanism of diffusion models to align visual features of specific video segments with embeddings of corresponding scene descriptions. The paper also introduces a method to create a multi-scene video-text dataset using Gemini-Pro-Vision for fine-tuning.	TALC, without fine-tuning, outperforms baselines like merging captions or videos in terms of visual consistency and text adherence. Fine-tuning the model with TALC and a multi-scene dataset further improves performance, particularly text adherence, as measured by automatic and human evaluation. The proposed approach maintains visual quality comparable to the base model, unlike fine-tuning with naively merged captions.	The performance of multi-scene video generation decreases as the number of scenes increases. The reliance on scene detection for multi-scene data generation could introduce errors if the detection is inaccurate.	text-to-video generation, diffusion models, multi-scene video, video generation, text conditioning
2405.04533 Report	ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning	Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black	Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.	ChatHuman is a multi-modal Large Language Model (LLM) specialized for 3D human understanding. It leverages a wide range of existing 3D human analysis tools to perform tasks like pose estimation, shape measurement, and contact reasoning.	Existing 3D human analysis methods often work in isolation. ChatHuman integrates these disparate tools into a single system, enabling more accurate and comprehensive 3D human reasoning.	ChatHuman uses a paper-based Retrieval-Augmented Generation (RAG) mechanism to understand tool functions by reading relevant academic papers. It is fine-tuned on a dataset of instruction-following data constructed with GPT-4V, learning to select, use, discriminate, and integrate tool results.	ChatHuman outperforms existing LLM-based methods in tool selection and usage accuracy, especially for tools unseen during training. It achieves state-of-the-art performance on various 3D human understanding tasks, including pose estimation, shape measurement, and human-object interaction detection. The model demonstrates the ability to combine tool outputs with its general knowledge to solve complex reasoning tasks, like reasoning-based pose estimation and speculative pose generation.	ChatHuman's performance depends on the clarity of user requests and the capabilities of existing tools. It currently primarily focuses on text and image modalities, with limited exploration of video and motion analysis.	3d human understanding, large language models, tool use, retrieval-augmented generation, multi-modal learning
2405.04496 Report	Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing	Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo	Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.	Proposes Edit-Your-Motion, a one-shot video motion editing method that decouples spatio-temporal features in space-time diffusion models for accurate motion editing while preserving object content and background.	Existing video motion editing methods struggle to maintain content and background consistency due to entangled spatial and temporal features in the diffusion models.	Introduces Detailed Prompt-Guided Learning Strategy (DPL) with two training stages: (1) learning spatial features from shuffled, background-masked frames using Recurrent-Causal Attention, and (2) learning temporal features from ordered frames using Temporal Attention and Noise Constraint Loss. Employs a two-branch structure during inference to inject spatial features from the reconstruction branch into the editing branch.	Achieves accurate motion alignment with the reference video while preserving the source video's object content and background. Outperforms state-of-the-art methods in both qualitative and quantitative evaluations, including CLIP similarity and LPIPS metrics. Demonstrates superior performance in user studies, with participants preferring Edit-Your-Motion for text alignment, content alignment, and motion alignment.	Two-stage training demands considerable computational resources. Further exploration is needed to enable video motion editing with limited computational power.	video motion editing, space-time diffusion model, detailed prompt-guided learning, recurrent-causal attention, noise constraint loss
2405.04404 Report	Vision Mamba: A Comprehensive Survey and Taxonomy	Xiao Liu, Chenxu Zhang, Lei Zhang	State Space Model (SSM) is a mathematical model used to describe and analyze the behavior of dynamic systems. This model has witnessed numerous applications in several fields, including control theory, signal processing, economics and machine learning. In the field of deep learning, state space models are used to process sequence data, such as time series analysis, natural language processing (NLP) and video understanding. By mapping sequence data to state space, long-term dependencies in the data can be better captured. In particular, modern SSMs have shown strong representational capabilities in NLP, especially in long sequence modeling, while maintaining linear time complexity. Notably, based on the latest state-space models, Mamba merges time-varying parameters into SSMs and formulates a hardware-aware algorithm for efficient training and inference. Given its impressive efficiency and strong long-range dependency modeling capability, Mamba is expected to become a new AI architecture that may outperform Transformer. Recently, a number of works have attempted to study the potential of Mamba in various fields, such as general vision, multi-modal, medical image analysis and remote sensing image analysis, by extending Mamba from natural language domain to visual domain. To fully understand Mamba in the visual domain, we conduct a comprehensive survey and present a taxonomy study. This survey focuses on Mamba's application to a variety of visual tasks and data types, and discusses its predecessors, recent advances and far-reaching impact on a wide range of domains. Since Mamba is now on an upward trend, please actively notice us if you have new findings, and new progress on Mamba will be included in this survey in a timely manner and updated on the Mamba project at https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy.	This paper presents a comprehensive survey of Vision Mamba, a recent advancement in deep learning that leverages State Space Models (SSMs) for visual tasks, surpassing traditional CNN and Transformer architectures.	Vision Mamba is gaining increasing attention due to its superior performance in visual tasks, particularly its ability to efficiently process long sequences and handle high-resolution images, making it a potential game-changer in the field.	The paper systematically categorizes Vision Mamba variants based on their applications, such as general vision, multi-modal learning, and vertical domains like remote sensing and medical image analysis. It provides a detailed taxonomy, principles, and technical details of each variant.	Vision Mamba models exhibit remarkable computational efficiency and effectiveness in various tasks, including image classification, object detection, semantic segmentation, video analysis, and image restoration. They excel in handling high-resolution inputs and complex data dependencies, achieving superior performance with lower computational costs compared to CNNs and Transformers. The survey highlights the advancements in scanning mechanisms and synergistic hybrid architectures that contribute to Vision Mamba's success.	Further research is needed to design more sophisticated scanning mechanisms to optimize Vision Mamba's performance for specific visual tasks, especially for capturing intricate spatial relationships. Exploring the combination of Vision Mamba with other architectures like Transformers, while addressing their inherent differences in sequence modeling, holds potential for further performance improvement.	state space model, vision mamba, computer vision, deep learning, multi-modal learning, remote sensing, medical image analysis
2405.04356 Report	Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation	Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn	We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.	This paper introduces a novel multi-modal face image generation method that leverages the strengths of both diffusion models (DMs) and Generative Adversarial Networks (GANs). The proposed approach uses a pre-trained DM as an encoder to extract multi-modal features from text prompts and visual inputs (e.g., semantic masks, scribbles) and then maps them into the latent space of a pre-trained GAN for high-quality face image synthesis.	Existing methods for multi-modal face image generation struggle to effectively combine text and visual inputs, often leading to inconsistencies between the generated image and the input conditions. This work addresses these limitations by introducing a novel framework that effectively bridges the gap between DMs and GANs, enabling more accurate and controllable face image generation.	The proposed method utilizes a mapping network to connect the latent spaces of the pre-trained DM and GAN. An attention-based style modulation network refines the mapped latent code by leveraging multi-scale features and cross-attention maps from the DM decoder, capturing fine-grained details from the input text and visual conditions. The model is trained across multiple denoising steps to effectively capture the evolving semantic representations in the DM.	The method generates high-quality 2D and 3D-aware face images that are consistent with both text prompts and visual inputs, outperforming existing GAN-based and DM-based methods in terms of visual quality and semantic accuracy. The proposed approach demonstrates superior performance in preserving the identity of the input image while modifying facial attributes based on the given text prompt, as evidenced by quantitative metrics like ID similarity. Ablation studies validate the effectiveness of each component, highlighting the importance of the mapping network, the attention-based style modulation network, and the multi-step training strategy.	The method currently faces limitations in transferring significantly distinct styles from artistic domains to the photo-realistic domain of GANs. Future work will explore mapping diffusion features related to pose into the GAN latent space to enhance 3D-aware face style transfer.	multi-modal image generation, face image synthesis, diffusion models, gan inversion, attention mechanisms
2405.04312 Report	Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer	Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang	Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 40964096), the resolution of generated images is often limited to 10241024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT.	Inf-DiT, an infinite-resolution diffusion model capable of upsampling images of various shapes and resolutions with memory efficiency, especially for ultra-high-resolution images.	Existing image diffusion models struggle to generate ultra-high-resolution images due to quadratic memory increase, limiting their application in various fields.	The paper proposes a Unidirectional Block Attention (UniBA) algorithm to reduce memory consumption from O(N^2) to O(N) by dividing the image into blocks and processing them sequentially, while maintaining global consistency. This allows for generating parts of the image in parallel based on memory restrictions. Additionally, global and local consistency techniques are employed using CLIP image embedding and nearby LR cross-attention.	Inf-DiT achieves state-of-the-art performance in ultra-high-resolution image generation on HPDv2 dataset, outperforming baselines in FID and FIDcrop metrics. It excels in classic super-resolution tasks on DIV2K dataset, surpassing other models in perceptual and fidelity metrics. Human evaluation confirms Inf-DiT’s superiority in detail authenticity, global coherence, and consistency with low-resolution input.	Correcting inaccuracies from earlier upsampling stages in iterative upsampling needs further exploration. The model's performance with different block sizes and their impact on memory and generation quality require further investigation.	diffusion models, ultra-high-resolution generation, super-resolution, unidirectional block attention, memory efficiency
2405.04233 Report	Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models	Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu	We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.	\name{} is a high-performance text-to-video generator that produces 1080p videos up to 16 seconds long in a single generation, using a U-ViT backbone for scalability and long sequence modeling.	Breaks duration limitations of previous video generation models that primarily relied on U-Net backbones and focused on shorter durations.	Employs a video autoencoder for dimensionality reduction, and a U-ViT model for noise prediction, trained on a vast dataset of text-video pairs automatically annotated using a high-performance video captioner.	Generates coherent and dynamic videos with 3D consistency, cuts, transitions, camera movements, lighting effects, and emotional portrayal. Exhibits imaginative ability, generating scenes beyond real-world scenarios. Shows promising results in controllable video generation tasks like canny-to-video, video prediction, and subject-driven generation.	Occasional flaws in details and interactions between subjects. Limited exploration of controllable generation at higher resolutions.	text-to-video generation, diffusion models, u-vit, video synthesis, controllable generation
2405.04007 Report	SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing	Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan	In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. The instruction tuned model demonstrates promising results, indicating the potential and effectiveness of SEED-Data-Edit in advancing the field of instructional image editing. The datasets are released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit.	Introduces SEED-Data-Edit, a hybrid dataset for instruction-guided image editing, combining automated, real-world, and multi-turn data.	Addresses the lack of high-quality, large-scale datasets for training models in the challenging field of instruction-guided image editing.	Combines three data sources: 1) Automated pipeline generating 'remove' and 'add' edits and style/object changes, 2) Real-world editing requests from photography websites, 3) Human-annotated multi-turn edits simulating iterative editing.	SEED-Data-Edit contains 3.7M image pairs and 21K multi-turn sequences (up to 5 rounds). Fine-tuned MLLM model SEED-X-Edit on the dataset shows promising results in following editing instructions. SEED-X-Edit outperforms baseline models, demonstrating the dataset's potential in advancing instructional image editing.	Current model training utilizes multi-turn data in a single-turn way. Future work will explore true multi-turn image editing.	image editing, instruction-guided, dataset, multimodal, large language model
2405.03958 Report	Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model	Joo Young Choi, Jaesung R. Park, Inkyu Park, Jaewoong Cho, Albert No, Ernest K. Ryu	Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.	This paper introduces a novel method for conditioning attention layers in diffusion models using Low-Rank Adaptation (LoRA), improving image generation quality.	Current state-of-the-art diffusion models lack direct conditioning on attention layers, potentially hindering performance optimization. This method addresses this gap by incorporating conditioning directly into attention mechanisms.	The authors implement various LoRA conditioning methods including TimeLoRA, ClassLoRA for discrete-time settings, and UC-LoRA for continuous SNR settings. They evaluate these methods on popular diffusion model architectures like IDDPM and EDM trained on CIFAR-10, FFHQ64, and ImageNet datasets.	Adding LoRA conditioning to attention layers consistently improves FID scores across different models and datasets, demonstrating its effectiveness. LoRA conditioning alone, even without conventional scale-and-shift conditioning on convolutional layers, achieves comparable FID scores, highlighting its capability. The method exhibits robustness in extrapolating conditioning information, showing potential for broader applications beyond class conditioning.	The paper acknowledges limited exploration of optimal LoRA rank and the number of bases due to computational constraints. Further research is needed to investigate the full potential of LoRA conditioning, particularly in large-scale diffusion models and text-to-image generation.	diffusion models, low-rank adaptation (lora), attention mechanism, image generation, conditioning methods
2405.03894 Report	MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View	Emmanuelle Bourigault, Pauline Bourigault	Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.	This paper presents MVDiff, a novel multi-view diffusion model for consistent image generation and 3D reconstruction using epipolar geometry constraints and multi-view attention within a transformer-based architecture.	Existing image-to-3D diffusion models struggle with generating consistent multiple views, limiting their use in tasks requiring accurate 3D understanding. MVDiff addresses this by enhancing consistency and enabling high-quality 3D reconstruction from limited input.	MVDiff leverages a scene representation transformer (SRT) to learn a latent 3D representation from input views. It then employs a view-conditioned latent diffusion model guided by epipolar geometry and multi-view attention to generate consistent novel views. These views are used for 3D reconstruction via techniques like NeuS.	MVDiff achieves superior novel view synthesis performance on the GSO dataset compared to baselines like Zero123-XL, exhibiting significant improvements in PSNR, SSIM, and LPIPS. The model demonstrates strong 3D reconstruction capabilities, outperforming methods like One-2-3-45 and SyncDreamer in Chamfer Distance and Volume IoU. Ablation studies confirm the importance of both epipolar and multi-view attention mechanisms for achieving consistent and high-fidelity results.	The model's computational cost, particularly during inference, presents a limitation. Future work could explore efficiency improvements. While MVDiff shows promising results, the generation of implausible meshes remains a challenge. Expanding the training dataset and refining data curation could alleviate this.	multi-view synthesis, 3d reconstruction, diffusion models, epipolar geometry, scene representation transformer
2405.03689 Report	Pose Priors from Language Models	Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, Trevor Darrell	We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.	This paper presents ProsePose, a zero-shot pose optimization method that leverages large multimodal models (LMMs) to improve 3D human pose estimation by enforcing accurate physical contact constraints.	Accurately capturing physical contact (self-contact and person-to-person) in 3D pose estimation is crucial for understanding human behavior and social interactions, but existing methods often struggle to do so accurately without expensive contact annotations.	ProsePose uses an LMM to generate natural language descriptions of contact points from an image. These descriptions are then converted into mathematical constraints, and a loss function based on these constraints is used to optimize the 3D pose estimates from a pose regressor.	ProsePose produces more accurate 3D pose reconstructions than previous zero-shot methods on multiple datasets of one- and two-person interactions. The method accurately captures semantically relevant contact points, improving both joint error and the percentage of correct contact points (PCC). This work demonstrates that LMMs have implicit knowledge of human pose and can be used as effective priors for 3D pose estimation.	The method's performance depends on the accuracy and consistency of the LMM's outputs, as LMM hallucination of contact points can occur. Future work could explore using multiple LMMs or developing more robust methods for handling LMM uncertainty and potential biases.	3d pose estimation, contact inference, large multimodal models, language priors, zero-shot learning
2405.03685 Report	Language-Image Models with 3D Understanding	Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone	Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.	This work introduces Cube-LLM, a multi-modal large language model (MLLM) capable of reasoning in both 2D and 3D for image understanding, by leveraging a new large-scale pretraining dataset and a unified training framework.	Extending the perceptual capabilities of MLLMs from 2D image coordinates to 3D view coordinates enables them to perceive and reason about visual input closer to how humans perceive the world, which is crucial for applications like autonomous driving.	The authors create LV3D, a large-scale 2D and 3D pretraining dataset by unifying existing datasets and formulating tasks as multi-turn question-answering. They decompose 3D labels into simpler components (point, depth, size, orientation), enabling versatile input/output formats and inducing 2D to 3D generalization. They also introduce visual chain-of-thought prompting and specialist model prompting to improve reasoning.	Cube-LLM significantly outperforms existing methods on 3D grounded reasoning tasks, achieving 21.3 points higher AP_BEV on the Talk2Car dataset. Cube-LLM demonstrates strong performance in complex reasoning about driving scenarios, improving the overall score by 17.7 points on the DriveLM dataset. Cube-LLM achieves state-of-the-art results in 2D referring expression comprehension (87.0 average score on refCOCO) and maintains competitive performance in standard MLLM benchmarks (VQAv2, GQA), showing that 3D reasoning is an expansion, not a trade-off.	Cube-LLM currently does not employ resampling methods to reduce the number of vision tokens, limiting its input resolution. Cube-LLM only supports single frame input, lacking the ability to reason about the dynamics of the environment from videos.	multi-modal large language models, 3d scene understanding, foundation models, autonomous driving, visual grounding
2405.03682 Report	An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas	Mira Slavcheva, Dave Gausebeck, Kevin Chen, David Buchhofer, Azwad Sabik, Chen Ma, Sachal Dhillon, Olaf Brandt, Alan Dolhasz	We propose a pipeline that leverages Stable Diffusion to improve inpainting results in the context of defurnishing -- the removal of furniture items from indoor panorama images. Specifically, we illustrate how increased context, domain-specific model fine-tuning, and improved image blending can produce high-fidelity inpaints that are geometrically plausible without needing to rely on room layout estimation. We demonstrate qualitative and quantitative improvements over other furniture removal techniques.	This paper presents a pipeline for defurnishing indoor panorama images, leveraging Stable Diffusion for improved inpainting results.	Defurnishing is crucial for digital twins in real estate, enabling personalized layouts, interior design experimentation, and property evaluation.	The pipeline involves furniture segmentation, context maximization via rolling and padding, robust unfurnished space inpainting using a fine-tuned Stable Diffusion model trained on a dataset of unfurnished panoramas with synthetic furniture and shadows, superresolution, and a custom blending strategy.	The fine-tuned Stable Diffusion model effectively reduces hallucinations of furniture in empty spaces. The method produces high-fidelity inpaints that are geometrically plausible without requiring room layout estimation. Quantitative and qualitative evaluations demonstrate superior performance compared to existing techniques like LaMa and LGPN-Net.	The method may occasionally exhibit structural alterations or lingering hallucinations. The reliance on synthetic data for training may lead to domain shift issues, impacting the quality of results on real-world images.	image inpainting, stable diffusion, defurnishing, panorama images, digital twins
2405.03673 Report	MemoryMamba: Memory-Augmented State Space Model for Defect Recognition	Qianning Wang, He Hu, Yucheng Zhou	As automation advances in manufacturing, the demand for precise and sophisticated defect detection technologies grows. Existing vision models for defect recognition methods are insufficient for handling the complexities and variations of defects in contemporary manufacturing settings. These models especially struggle in scenarios involving limited or imbalanced defect data. In this work, we introduce MemoryMamba, a novel memory-augmented state space model (SSM), designed to overcome the limitations of existing defect recognition models. MemoryMamba integrates the state space model with the memory augmentation mechanism, enabling the system to maintain and retrieve essential defect-specific information in training. Its architecture is designed to capture dependencies and intricate defect characteristics, which are crucial for effective defect detection. In the experiments, MemoryMamba was evaluated across four industrial datasets with diverse defect types and complexities. The model consistently outperformed other methods, demonstrating its capability to adapt to various defect recognition scenarios.	This paper introduces MemoryMamba, a novel memory-augmented state space model (SSM) for defect recognition, designed to overcome limitations of existing models in handling complexities and variations of defects, especially with limited or imbalanced data.	Accurate defect recognition is crucial in manufacturing for quality control, production efficiency, cost reduction, and product reliability. Existing methods struggle with limited or imbalanced defect data, common in industrial settings.	MemoryMamba integrates SSMs with memory augmentation, enabling it to retain and retrieve defect-specific information. It uses coarse- and fine-grained memory networks with a fusion module, optimized by contrastive learning and mutual information maximization.	MemoryMamba consistently outperforms existing models (ResNet, DeiT, Swin, Vmamba) in accuracy, precision, recall, and F1 score across four industrial datasets. Ablation studies confirm the essential role of coarse-grained memory networks, fine-grained memory networks, and the fusion module in achieving superior performance. The choice of similarity metric in the fusion module and memory size for both memory networks significantly impacts the model's effectiveness, with cosine similarity and specific sizes showing better results depending on the dataset.	The optimal memory size is dataset-dependent and requires careful tuning. The model's performance with other memory augmentation techniques or optimization strategies is yet to be explored.	defect recognition, state space models, memory augmentation, computer vision, manufacturing
2405.03659 Report	A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose	Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xiaolong Wang, Hao Su, Ravi Ramamoorthi	Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset.	This paper introduces a novel construct-and-optimize approach for sparse view synthesis using 3D Gaussian splatting, eliminating the need for known camera poses.	Existing NeRF-based methods struggle with sparse view synthesis, especially when camera poses are unknown or inaccurate. This work addresses this challenge by constructing a solution based on monocular depth and optimizing it using correspondences.	The method constructs a coarse solution by progressively registering and adjusting camera poses and depths via a differentiable pipeline that leverages 2D correspondences. It introduces a novel concept of an expected surface in Gaussian splatting for accurate correspondence matching. This coarse solution is then refined using standard optimization.	Achieves state-of-the-art results on Tanks and Temples and Static Hikes datasets with as few as three views. Significantly outperforms competing methods, including those using approximate camera pose information. Performance improves with more views, outperforming previous methods even when using half the dataset.	Constructing the coarse solution depends on the scale-consistent assumption of estimated monocular depth, which doesn't always hold for complex scenes. Assumes overlapping between consecutive frames, limiting its applicability to unordered image collections.	view synthesis, 3d gaussian splatting, camera optimization, sparse view, correspondence matching
2405.03486 Report	UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images	Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, Yang Zhang	Image safety classifiers play an important role in identifying and mitigating the spread of unsafe images online (e.g., images including violence, hateful rhetoric, etc.). At the same time, with the advent of text-to-image models and increasing concerns about the safety of AI models, developers are increasingly relying on image safety classifiers to safeguard their models. Yet, the performance of current image safety classifiers remains unknown for real-world and AI-generated images. To bridge this research gap, in this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough in mitigating the multifaceted problem of unsafe images. Also, we find that classifiers trained only on real-world images tend to have degraded performance when applied to AI-generated images. Motivated by these findings, we design and implement a comprehensive image moderation tool called PerspectiveVision, which effectively identifies 11 categories of real-world and AI-generated unsafe images. The best PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation datasets, which is comparable with closed-source and expensive state-of-the-art models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.	This paper introduces UnsafeBench, a benchmarking framework for evaluating image safety classifiers on both real-world and AI-generated images.	Image safety classifiers are crucial for mitigating the spread of harmful content online, but their performance on diverse and AI-generated imagery remains underexplored.	The authors curate a dataset of 10K real-world and AI-generated images, labeled across 11 unsafe categories. They evaluate the effectiveness and robustness of 5 conventional and 3 VLM-based image safety classifiers. Additionally, they develop PerspectiveVision, a comprehensive image moderation tool.	Existing image safety classifiers show imbalanced performance across unsafe categories and struggle with AI-generated images. Classifiers trained on real-world images experience performance degradation on AI-generated images, likely due to distinct characteristics in the latter, such as artistic representations and grid layouts. PerspectiveVision, the proposed image moderation tool, achieves comparable performance to GPT-4V in identifying unsafe images, showcasing its potential as a benchmark tool.	The UnsafeBench dataset exhibits bias towards unsafe content and limited diversity in AI-generated images, potentially affecting the generalizability of findings. The generalizability of PerspectiveVision is evaluated on a limited range of unsafe categories due to a lack of labeled datasets.	image safety, content moderation, ai-generated images, benchmarking framework, visual language models
2405.03436 Report	DBDH: A Dual-Branch Dual-Head Neural Network for Invisible Embedded Regions Localization	Chengxin Zhao, Hefei Ling, Sijing Xie, Nan Sun, Zongyi Li, Yuxuan Shi, Jiazhong Chen	Embedding invisible hyperlinks or hidden codes in images to replace QR codes has become a hot topic recently. This technology requires first localizing the embedded region in the captured photos before decoding. Existing methods that train models to find the invisible embedded region struggle to obtain accurate localization results, leading to degraded decoding accuracy. This limitation is primarily because the CNN network is sensitive to low-frequency signals, while the embedded signal is typically in the high-frequency form. Based on this, this paper proposes a Dual-Branch Dual-Head (DBDH) neural network tailored for the precise localization of invisible embedded regions. Specifically, DBDH uses a low-level texture branch containing 62 high-pass filters to capture the high-frequency signals induced by embedding. A high-level context branch is used to extract discriminative features between the embedded and normal regions. DBDH employs a detection head to directly detect the four vertices of the embedding region. In addition, we introduce an extra segmentation head to segment the mask of the embedding region during training. The segmentation head provides pixel-level supervision for model learning, facilitating better learning of the embedded signals. Based on two state-of-the-art invisible offline-to-online messaging methods, we construct two datasets and augmentation strategies for training and testing localization models. Extensive experiments demonstrate the superior performance of the proposed DBDH over existing methods.	This paper proposes DBDH, a dual-branch dual-head neural network for accurate localization of invisible embedded regions in images.	Accurate localization of embedded regions is crucial for decoding messages in invisible offline-to-online messaging systems, but existing methods struggle to accurately locate these regions due to the high-frequency nature of embedded signals.	DBDH uses a low-level texture branch with high-pass filters to capture high-frequency embedded signals and a high-level context branch to extract discriminative features. It employs a vertex detection head for localization and a segmentation head during training for region-wise supervision.	DBDH outperforms existing methods like StegaStamp and Invisible Markers in localization accuracy. The use of high-pass filters in the texture branch is shown to be effective for capturing embedded signals. The addition of the segmentation head during training improves localization performance.	The model is evaluated on datasets based on only two specific offline-to-online messaging schemes. Further research could explore the generalization of DBDH to other embedding methods and real-world scenarios.	offline-to-online messaging, invisible embedded regions localization, high-pass filter, segmentation, keypoint detection
2405.03349 Report	Retinexmamba: Retinex-based Mamba for Low-light Image Enhancement	Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, Xiaofeng Zhang	In the field of low-light image enhancement, both traditional Retinex methods and advanced deep learning techniques such as Retinexformer have shown distinct advantages and limitations. Traditional Retinex methods, designed to mimic the human eye's perception of brightness and color, decompose images into illumination and reflection components but struggle with noise management and detail preservation under low light conditions. Retinexformer enhances illumination estimation through traditional self-attention mechanisms, but faces challenges with insufficient interpretability and suboptimal enhancement effects. To overcome these limitations, this paper introduces the RetinexMamba architecture. RetinexMamba not only captures the physical intuitiveness of traditional Retinex methods but also integrates the deep learning framework of Retinexformer, leveraging the computational efficiency of State Space Models (SSMs) to enhance processing speed. This architecture features innovative illumination estimators and damage restorer mechanisms that maintain image quality during enhancement. Moreover, RetinexMamba replaces the IG-MSA (Illumination-Guided Multi-Head Attention) in Retinexformer with a Fused-Attention mechanism, improving the model's interpretability. Experimental evaluations on the LOL dataset show that RetinexMamba outperforms existing deep learning approaches based on Retinex theory in both quantitative and qualitative metrics, confirming its effectiveness and superiority in enhancing low-light images.	This paper presents RetinexMamba, a novel architecture for low-light image enhancement leveraging Retinex theory and State Space Models (SSMs).	Traditional Retinex methods and deep learning techniques, while effective, have limitations in noise management, detail preservation, interpretability, and computational efficiency in low-light image enhancement.	RetinexMamba combines an Illumination Estimator (inspired by traditional Retinex) with a Damage Restorer based on Illumination Fusion Visual Mamba (IFVM). IFVM utilizes Illumination Fusion State Space Model (IFSSM) featuring 2D Selective Scanning (SS2D) for linear computational efficiency and Illumination Fusion Attention (IFA) for improved interpretability.	RetinexMamba outperforms existing deep learning methods based on Retinex theory on the LOL dataset in quantitative metrics like PSNR and RMSE. Qualitative results show RetinexMamba effectively controls exposure, reduces color distortion, and minimizes noise compared to other SOTA algorithms. Ablation studies demonstrate the benefits of individual components like SS2D and IFA in improving performance.	Despite reduced complexity in SS2D, the overall parameter count is increased, leading to higher computational resource consumption. Future work will focus on reducing the total number of parameters while preserving computational efficiency.	retinex, low-light enhancement, fused-attention, retinexformer, state space model
2405.03318 Report	Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation	Yingying Zhang, Chuangji Shi, Xin Guo, Jiangwei Lao, Jian Wang, Jiaotuan Wang, Jingdong Chen	The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP.	This paper introduces a novel plug-and-play module named Self-Adaptive Content Query (SACQ) for improving object detection in DETR and its variants by optimizing the content aspect of object queries.	The content query, crucial for DETR's performance, is often initialized with limited information, leading to sub-optimal results. This paper addresses this limitation to enhance object detection accuracy.	The SACQ module utilizes self-attention pooling to generate content queries from transformer encoder features. It uses global pooling for initialization and local pooling for refinement. A Query Aggregation (QA) strategy is also proposed to merge similar predictions, further boosting performance.	SACQ consistently improves performance across six different DETR variants with an average gain of over 1.0 AP on the COCO dataset. The method shows effectiveness in both iterative bounding box refinement and two-stage Deformable-DETR settings. Visualization of attention maps confirms that SACQ accurately focuses on target objects, demonstrating its effectiveness in content query enhancement.	The performance gain on the state-of-the-art DINO method is not significant, suggesting further research on joint optimization of content query and matching strategies. The influence of different thresholds in query aggregation and its interaction with SACQ requires further investigation.	object detection, detr, transformer, content query, self-attention
2405.03243 Report	Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data	Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese, Jan Mathias Koehler, Lukas Schott	Generative foundation models like Stable Diffusion comprise a diverse spectrum of knowledge in computer vision with the potential for transfer learning, e.g., via generating data to train student models for downstream tasks. This could circumvent the necessity of collecting labeled real-world data, thereby presenting a form of data-free knowledge distillation. However, the resultant student models show a significant drop in accuracy compared to models trained on real data. We investigate possible causes for this drop and focus on the role of the different layers of the student model. By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers. Further, we briefly investigate other factors, such as differences in data-normalization between synthetic and real, the impact of data augmentations, texture vs.\ shape learning, and assuming oracle prompts. While we find that some of those factors can have an impact, they are not sufficient to close the gap towards real data. Building upon our insights that mainly later layers are responsible for the drop, we investigate the data-efficiency of fine-tuning a synthetically trained model with real data applied to only those last layers. Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy. Our findings contribute to the understanding of the gap between synthetic and real data and indicate solutions to mitigate the scarcity of labeled real data.	The study investigates transfer learning capabilities between models trained on real and synthetic ImageNet-100 datasets, particularly focusing on freezing initial layers pre-trained on one dataset and training remaining layers on the other.	Understanding how knowledge transfers between real and synthetic datasets is crucial for leveraging synthetic data's potential in training models for real-world applications, especially when real data is scarce or expensive.	The authors systematically experiment with freezing varying numbers of initial layers in a ResNet-like model. They pre-train models on either real or synthetic ImageNet-100, then train the remaining layers on the other dataset. Performance is evaluated on real ImageNet-100 validation data.	Transferring knowledge from real to synthetic data is less effective than vice-versa. Freezing a significant number of initial layers pre-trained on synthetic data shows comparable results to models trained entirely on real data. Models with initial layers pre-trained on synthetic data exhibit better resilience to reductions in the amount of real training data.	The study focuses solely on ImageNet-100; generalizability to other datasets needs further investigation. Exploring the impact of different synthetic data generation techniques on transfer learning could be beneficial.	transfer learning, synthetic data, imagenet, deep learning, computer vision
2405.03150 Report	Video Diffusion Models: A Survey	Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter	Diffusion generative models have recently become a robust technique for producing and modifying coherent, high-quality video. This survey offers a systematic overview of critical elements of diffusion models for video generation, covering applications, architectural choices, and the modeling of temporal dynamics. Recent advancements in the field are summarized and grouped into development trends. The survey concludes with an overview of remaining challenges and an outlook on the future of the field. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models	This paper presents a survey of video diffusion models, focusing on applications, architectural choices, temporal dynamic modeling, and training approaches.	Video diffusion models have the potential to revolutionize video generation, editing, and simulation, making this survey timely and relevant.	The paper systematically categorizes existing work on video diffusion models by application, architectural choices for temporal modeling, training strategies, and benchmarks. It summarizes notable papers in each category and discusses their contributions.	The survey identifies key architectural trends such as the use of UNets, Vision Transformers, cascaded, and latent diffusion models for video generation. It highlights various methods for modeling temporal dynamics, including spatio-temporal attention mechanisms, temporal upsampling techniques, and structure preservation. The paper discusses ongoing challenges including training data limitations, computational costs, and modeling long-term dependencies, while providing directions for future research.	The rapid evolution of video diffusion models may lead to some discussed approaches becoming quickly outdated. The survey primarily focuses on technical aspects and might benefit from a deeper discussion of ethical considerations surrounding generative video models.	video diffusion models, generative models, video generation, video editing, deep learning
2405.03121 Report	AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding	Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu	The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker's capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed at https://github.com/X-LANCE/AniTalker.	AniTalker is a novel framework that generates realistic and diverse talking face animations from a single portrait image by decoupling identity and motion.	Existing methods often fail to capture the complex dynamics of facial expressions and nonverbal cues, limiting their ability to create truly lifelike avatars.	The framework utilizes a self-supervised learning approach with a universal motion encoder, metric learning for identity recognition, mutual information minimization for disentanglement, and a diffusion model with a variance adapter for generating diverse and controllable facial animations.	AniTalker outperforms existing methods in generating realistic and expressive talking face animations, as evidenced by both quantitative metrics (PSNR, SSIM, LPIPS, CSIM) and qualitative assessments. The framework demonstrates strong identity preservation capabilities, effectively separating motion from appearance even in cross-driven scenarios where the source and target identities differ. The motion representation learned by AniTalker exhibits strong generalization ability, enabling animation of diverse facial structures including cartoons and sculptures.	The current rendering network generates frames individually, which can lead to temporal inconsistencies, especially in complex backgrounds. Extreme head poses can lead to blurring artifacts due to limitations of the warping technique.	talking face, self-supervised learning, motion encoding, disentanglement, diffusion models
2405.03025 Report	Matten: Video Generation with Mamba-Attention	Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, Lin Ma	In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the current Transformer-based and GAN-based models in benchmark performance, achieving superior FVD scores and efficiency. Additionally, we observe a direct positive correlation between the complexity of our designed model and the improvement in video quality, indicating the excellent scalability of Matten.	This paper introduces Matten, a novel latent diffusion model for video generation that leverages the Mamba-Attention architecture for efficient and high-quality video synthesis.	The development of efficient and effective video generation models is crucial due to the increasing demand for high-quality video content in various applications. Existing methods often suffer from high computational costs or limitations in capturing complex spatio-temporal dynamics in videos.	The study explored four variants of the Matten model, each employing different combinations of Mamba and attention mechanisms for capturing local and global spatio-temporal relationships in videos. The models were trained and evaluated on four benchmark datasets: FaceForensics, SkyTimelapse, UCF101, and Taichi-HD.	Matten achieves competitive FVD scores compared to state-of-the-art video generation models, demonstrating its effectiveness in generating high-quality videos. The study found that combining global Mamba scans for temporal modeling with attention mechanisms for local spatio-temporal modeling yielded the best performance. Matten exhibits good scalability, with larger model sizes leading to improved video generation quality.	The lack of pre-trained Mamba-based image generation models necessitates training Matten from scratch, potentially limiting its initial performance. Further research can explore incorporating pre-trained Mamba models and advanced techniques like distillation to enhance Matten's efficiency and performance.	video generation, diffusion models, mamba, attention mechanism, spatio-temporal modeling
2405.03008 Report	DVMSR: Distillated Vision Mamba for Efficient Super-Resolution	Xiaoyan Lei, Wenlong Zhang, Weifeng Cao	Efficient Image Super-Resolution (SR) aims to accelerate SR network inference by minimizing computational complexity and network parameters while preserving performance. Existing state-of-the-art Efficient Image Super-Resolution methods are based on convolutional neural networks. Few attempts have been made with Mamba to harness its long-range modeling capability and efficient computational complexity, which have shown impressive performance on high-level vision tasks. In this paper, we propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy. The network of DVMSR consists of three modules: feature extraction convolution, multiple stacked Residual State Space Blocks (RSSBs), and a reconstruction module. Specifically, the deep feature extraction module is composed of several residual state space blocks (RSSB), each of which has several Vision Mamba Moudles(ViMM) together with a residual connection. To achieve efficiency improvement while maintaining comparable performance, we employ a distillation strategy to the vision Mamba network for superior performance. Specifically, we leverage the rich representation knowledge of teacher network as additional supervision for the output of lightweight student networks. Extensive experiments have demonstrated that our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters while maintaining the performance of both PSNR and SSIM. The source code is available at https://github.com/nathan66666/DVMSR.git	This paper introduces DVMSR, a lightweight image super-resolution network leveraging Vision Mamba and a knowledge distillation strategy to achieve efficient inference and maintain high performance.	Efficient image super-resolution aims to improve image quality with minimal computational cost and parameter usage, which is crucial for applications on resource-constrained devices. This work explores the potential of Mamba networks for efficient SR.	DVMSR consists of feature extraction, multiple Residual State Space Blocks (RSSBs) with Vision Mamba Modules, and a reconstruction module. A distillation strategy is employed where a larger, pre-trained Mamba network guides the learning of the smaller DVMSR.	DVMSR outperforms state-of-the-art efficient SR methods in terms of parameter count while achieving comparable or even better PSNR and SSIM scores. The use of Vision Mamba modules enables long-range dependency modeling, leading to improved image details in the reconstruction process. The distillation strategy effectively transfers knowledge from the teacher network to DVMSR, further enhancing its performance.	The current study primarily focuses on the final model architecture without extensive exploration of parameter optimization. Further research is needed to investigate the balance point between teacher and student model performance in the distillation process.	image super-resolution, efficient deep learning, vision mamba, state space models, knowledge distillation
2405.02982 Report	Paintings and Drawings Aesthetics Assessment with Rich Attributes for Various Artistic Categories	Xin Jin, Qianqian Qiao, Yi Lu, Shan Gao, Heng Huang, Guangdong Li	Image aesthetic evaluation is a highly prominent research domain in the field of computer vision. In recent years, there has been a proliferation of datasets and corresponding evaluation methodologies for assessing the aesthetic quality of photographic works, leading to the establishment of a relatively mature research environment. However, in contrast to the extensive research in photographic aesthetics, the field of aesthetic evaluation for paintings and Drawings has seen limited attention until the introduction of the BAID dataset in March 2023. This dataset solely comprises overall scores for high-quality artistic images. Our research marks the pioneering introduction of a multi-attribute, multi-category dataset specifically tailored to the field of painting: Aesthetics of Paintings and Drawings Dataset (APDD). The construction of APDD received active participation from 28 professional artists worldwide, along with dozens of students specializing in the field of art. This dataset encompasses 24 distinct artistic categories and 10 different aesthetic attributes. Each image in APDD has been evaluated by six professionally trained experts in the field of art, including assessments for both total aesthetic scores and aesthetic attribute scores. The final APDD dataset comprises a total of 4985 images, with an annotation count exceeding 31100 entries. Concurrently, we propose an innovative approach: Art Assessment Network for Specific Painting Styles (AANSPS), designed for the assessment of aesthetic attributes in mixed-attribute art datasets. Through this research, our goal is to catalyze advancements in the field of aesthetic evaluation for paintings and drawings, while enriching the available resources and methodologies for its further development and application.	This paper introduces APDD, the first multi-attribute, multi-category dataset for aesthetic evaluation of paintings and drawings, addressing the limitations of existing datasets that primarily focus on photographic images or lack attribute annotations.	The development of aesthetic evaluation models for paintings and drawings is hampered by the lack of comprehensive datasets that consider diverse artistic categories, styles, and aesthetic attributes. APDD fills this gap and enables research on more nuanced and interpretable aesthetic assessment.	A team of 28 professional artists and students constructed APDD by: 1) Defining 24 artistic categories based on painting type, style, and subject matter; 2) Identifying 10 relevant aesthetic attributes; 3) Collecting 4,985 images from various sources; 4) Developing a scoring system and criteria; 5) Annotating images for overall aesthetic score and attribute scores with at least 6 evaluations per image.	APDD is the first multi-attribute, multi-category dataset for painting aesthetic evaluation, encompassing 24 categories, 10 attributes, and over 31,100 annotations. The paper proposes AANSPS, a novel network for assessing both total and attribute-specific aesthetic scores in paintings, outperforming existing methods on APDD. The research provides a clear framework for considering aesthetic components in paintings, classifying artistic categories, and defining scoring criteria for aesthetic attributes.	The current categorization and attributes in APDD, while extensive, are not exhaustive and can be expanded in future work to encompass the full breadth of painting styles and aesthetic qualities. Future work will focus on increasing the size of APDD, adding annotations for more attributes, and incorporating detailed language comments to enhance score interpretability.	computer vision, computational aesthetics, image aesthetic assessment, painting dataset, deep learning
2405.02945 Report	Invertible Residual Rescaling Models	Jinmin Li, Tao Dai, Yaohua Zha, Yilu Luo, Longfei Lu, Bin Chen, Zhi Wang, Shu-Tao Xia, Jingyun Zhang	Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models (IRRM) for image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. Specifically, we propose IRRM to build a deep network, which contains several Residual Downscaling Modules (RDMs) with long skip connections. Each RDM consists of several Invertible Residual Blocks (IRBs) with short connections. In this way, RDM allows rich low-frequency information to be bypassed by skip connections and forces models to focus on extracting high-frequency information from the image. Extensive experiments show that our IRRM performs significantly better than other state-of-the-art methods with much fewer parameters and complexity. Particularly, our IRRM has respectively PSNR gains of at least 0.3 dB over HCFlow and IRN in the x4 rescaling while only using 60% parameters and 50% FLOPs. The code will be available at https://github.com/THU-Kingmin/IRRM.	This paper proposes Invertible Residual Rescaling Models (IRRM) for highly accurate image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution.	Existing IRNs face training difficulties with deep networks, hindering their representational ability, and previous methods struggle with high-frequency information recovery.	IRRM employs Residual Downscaling Modules (RDMs) with long skip connections to facilitate training and focus on high-frequency information. Each RDM comprises Invertible Residual Blocks (IRBs) with short skip connections to enhance non-linear representation.	IRRM significantly outperforms state-of-the-art methods in PSNR and SSIM with fewer parameters and complexity. The residual connections in IRRM enhance model extensibility, enabling stable training with deeper networks. IRRM exhibits robustness to variations in the sampled latent variable 'z', ensuring accurate detail preservation.	The paper primarily focuses on image rescaling and may not directly generalize to other image processing tasks. Further investigation into alternative loss functions or network architectures within IRRM could potentially yield additional performance improvements.	image rescaling, invertible neural networks, residual learning, deep learning, image processing
2405.02859 Report	MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior	Honghua Chen, Chen Change Loy, Xingang Pan	Despite the emergence of successful NeRF inpainting methods built upon explicit RGB and depth 2D inpainting supervisions, these methods are inherently constrained by the capabilities of their underlying 2D inpainters. This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations, we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting, addressing both appearance and geometry aspects. MVIP-NeRF performs joint inpainting across multiple views to reach a consistent solution, which is achieved via an iterative optimization process based on Score Distillation Sampling (SDS). Apart from recovering the rendered RGB images, we also extract normal maps as a geometric representation and define a normal SDS loss that motivates accurate geometry inpainting and alignment with the appearance. Additionally, we formulate a multi-view SDS score function to distill generative priors simultaneously from different view images, ensuring consistent visual completion when dealing with large view variations. Our experimental results show better appearance and geometry recovery than previous NeRF inpainting methods.	Presents MVIP-NeRF, a novel method for multiview-consistent inpainting on NeRF scenes using diffusion priors for appearance and geometry completion.	Existing NeRF inpainting methods depend on explicit 2D inpainting, leading to inconsistencies and inaccurate geometry. MVIP-NeRF overcomes these limitations by leveraging diffusion priors for joint multiview inpainting.	Uses a masked NeRF training scheme with an appearance SDS loss for RGB images and a normal SDS loss for geometry, both guided by diffusion priors. Introduces multi-view score distillation for consistency in large view variations.	Achieves better appearance and geometry recovery compared to existing NeRF inpainting techniques on two real-world datasets. Demonstrates the effectiveness of appearance and geometry diffusion priors over using explicit 2D inpainting results. Shows the benefit of multi-view score distillation in improving consistency for scenes with large view changes.	Efficiency is impacted by the iterative detail recovery process using diffusion priors. Requires tuning of hyper-parameters related to diffusion priors (e.g., CFGs).	nerf, inpainting, diffusion models, score distillation sampling, multiview consistency
2405.02844 Report	SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion	Ziyun Qian, Zeyu Xiao, Zhenyi Wu, Dingkang Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Dongliang Kou, Lihua Zhang	Motion style transfer is a significant research direction in multimedia applications. It enables the rapid switching of different styles of the same motion for virtual digital humans, thus vastly increasing the diversity and realism of movements. It is widely applied in multimedia scenarios such as movies, games, and the Metaverse. However, most of the current work in this field adopts the GAN, which may lead to instability and convergence issues, making the final generated motion sequence somewhat chaotic and unable to reflect a highly realistic and natural style. To address these problems, we consider style motion as a condition and propose the Style Motion Conditioned Diffusion (SMCD) framework for the first time, which can more comprehensively learn the style features of motion. Moreover, we apply Mamba model for the first time in the motion style transfer field, introducing the Motion Style Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the SMCD framework, we propose Diffusion-based Content Consistency Loss and Content Consistency Loss to assist the overall framework's training. Finally, we conduct extensive experiments. The results reveal that our method surpasses state-of-the-art methods in both qualitative and quantitative comparisons, capable of generating more realistic motion sequences.	This paper introduces the Style Motion Conditioned Diffusion (SMCD) framework, a novel approach for motion style transfer that utilizes diffusion models with style motion as a condition, aiming to enhance the realism and naturalness of generated motion sequences.	Existing GAN-based methods for motion style transfer suffer from instability and convergence issues, hindering the generation of high-fidelity motion sequences. The SMCD framework addresses these limitations by leveraging the stability and convergence benefits of diffusion models.	The SMCD framework utilizes a diffusion model with style motion as a condition to learn motion features and variations comprehensively. It also incorporates a Motion Style Mamba (MSM) module, inspired by the Mamba model, to capture temporal information and preserve long-term dependencies within motion sequences. Additionally, Diffusion-based Content Consistency Loss and Diffusion-based Style Consistency Loss functions are introduced to constrain the content and style of generated motions.	SMCD generates more realistic motion sequences compared to state-of-the-art methods, as demonstrated by visual comparisons and quantitative metrics like FID, KID, and Diversity. The framework exhibits strong generalizability, effectively transferring styles to unseen motion categories. Ablation studies confirm the importance of each component within the SMCD framework, including the MSM module and the proposed loss functions, in achieving superior performance.	The definition of 'style' in motion style transfer remains an open question and requires further exploration. Future research can focus on expanding the application of diffusion-based methods in motion style transfer to further enhance performance.	motion style transfer, diffusion models, motion generation, mamba model, multimedia applications
2405.02843 Report	Residual-Conditioned Optimal Transport: Towards Structure-Preserving Unpaired and Paired Image Restoration	Xiaole Tang, Xin Hu, Xiang Gu, Jian Sun	Deep learning-based image restoration methods generally struggle with faithfully preserving the structures of the original image. In this work, we propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which models image restoration as an optimal transport (OT) problem for both unpaired and paired settings, introducing the transport residual as a unique degradation-specific cue for both the transport cost and the transport map. Specifically, we first formalize a Fourier residual-guided OT objective by incorporating the degradation-specific information of the residual into the transport cost. We further design the transport map as a two-pass RCOT map that comprises a base model and a refinement process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. By duality, the RCOT problem is transformed into a minimax optimization problem, which can be solved by adversarially training neural networks. Extensive experiments on multiple restoration tasks show that RCOT achieves competitive performance in terms of both distortion measures and perceptual quality, restoring images with more faithful structures as compared with state-of-the-art methods.	This paper proposes Residual-Conditioned Optimal Transport (RCOT), modeling image restoration as an Optimal Transport problem. RCOT introduces a "transport residual" that captures degradation-specific information, improving structure preservation in restored images.	Current deep learning-based image restoration methods struggle to balance removing distortions and preserving original image structures. This new approach aims to address this challenge by incorporating degradation-specific knowledge.	The method leverages a two-pass process: 1) A base model generates an initial restored image and calculates the "transport residual." 2) The residual is encoded as an embedding, conditioning a second restoration pass for structure preservation. This is framed as a minimax optimization problem, solved by adversarially training neural networks.	RCOT achieves competitive performance on benchmark datasets for denoising, deraining, dehazing, and super-resolution. The method excels in preserving structural details compared to existing techniques. Ablation studies confirm the contribution of the residual conditioning and the Fourier residual-guided OT objective.	The handcrafted priors used to characterize the Fourier residual may not be optimal for all degradation types. Future work aims to explore automatic and adaptive learning of these priors and extend RCOT to an all-in-one restoration framework.	image restoration, optimal transport, structure preservation, deep learning, residual learning
2405.02793 Report	ImageInWords: Unlocking Hyper-Detailed Image Descriptions	Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut	Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.	The paper introduces ImageInWords (IIW), a novel human-in-the-loop framework for creating hyper-detailed, hallucination-free image descriptions, along with a new dataset produced using this method.	Existing image description datasets are limited by short, noisy web-scraped captions, hindering the development of vision-language models capable of generating rich, accurate descriptions.	IIW combines human annotations with machine-generated seeds in a sequential process. First, object-level descriptions are generated and refined. Then, these are used to create a detailed image description, iteratively improved by multiple annotators.	Human evaluation shows IIW descriptions are significantly preferred over those from existing datasets (DCI, DOCCI) and GPT-4V outputs. Models fine-tuned on IIW generate higher-quality descriptions, enabling better image reconstruction with T2I models. IIW descriptions improve compositional reasoning accuracy on ARO, SVO-Probes, and Winoground, demonstrating their richness and detail.	The seeded, sequential nature of the framework may introduce biases or inefficiencies depending on initial annotation quality. Human side-by-side evaluations, while comprehensive, were limited to hundreds of samples due to their cost and complexity.	image description, vision-language models, dataset, human-in-the-loop, compositional reasoning
2405.02730 Report	U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers	Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang	Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.	The paper proposes U-shaped Diffusion Transformers (U-DiTs) for latent-space image generation by leveraging the U-Net architecture with downsampled self-attention to reduce redundancy and enhance performance.	Recent Diffusion Transformer (DiT) models utilize isotropic architectures, neglecting the potential benefits of the U-Net structure commonly employed in diffusion models.	The authors first investigate a naive U-Net-style DiT (DiT-UNet), then introduce token downsampling for self-attention to improve efficiency. They scale up this approach, incorporating cosine similarity attention, RoPE2D, depthwise convolution in FFN, and re-parametrization.	U-DiTs significantly outperform isotropic DiTs, achieving comparable or superior performance with reduced computational costs. U-DiT-B surpasses DiT-XL/2 in FID score with only 1/6th of its computation cost. U-DiTs demonstrate consistent performance improvements with extended training steps (up to 1 million).	Further exploration of larger U-DiT models and extended training iterations is limited by computational resources. Future work may involve exploring the application of U-DiTs in other generative tasks beyond image synthesis.	diffusion models, vision transformers, u-net, image generation, latent space
2405.02700 Report	Towards a Scalable Identification of Novel Modes in Generative Models	Jingwei Zhang, Mohammad Jalali, Cheuk Ting Li, Farzan Farnia	An interpretable comparison of generative models requires the identification of sample types produced more frequently by each of the involved models. While several quantitative scores have been proposed in the literature to rank different generative models, such score-based evaluations do not reveal the nuanced differences between the generative models in capturing various sample types. In this work, we propose a method called Fourier-based Identification of Novel Clusters (FINC) to identify modes produced by a generative model with a higher frequency in comparison to a reference distribution. FINC provides a scalable stochastic algorithm based on random Fourier features to estimate the eigenspace of kernel covariance matrices of two generative models and utilize the principal eigendirections to detect the sample types present more dominantly in each model. We demonstrate the application of the FINC method to standard computer vision datasets and generative model frameworks. Our numerical results suggest the scalability and efficiency of the developed Fourier-based method in highlighting the sample types captured with different frequencies by widely-used generative models.	The paper proposes FINC, a scalable algorithm for identifying and clustering novel sample types generated by a generative model with a higher frequency compared to a reference distribution.	This addresses the limitations of score-based generative model evaluations, which fail to provide nuanced comparisons of how differently models capture various sample types.	FINC uses random Fourier features to approximate the eigenspace of kernel covariance matrices of two generative models. It leverages principal eigendirections to detect dominant sample types in each model.	FINC effectively identifies novel modes between real datasets (e.g., AFHQ vs. ImageNet-dogs) and between generative models (e.g., LDM vs. others). Theoretical analysis shows FINC's scalability, requiring a logarithmic number of Fourier features relative to the data size. Empirical evaluation demonstrates FINC's efficiency and accuracy on large-scale image datasets like ImageNet.	The paper focuses on Gaussian kernels, exploring other kernels is left for future work. Extending the framework to compare more than two generative models simultaneously is an interesting research direction.	generative models, differential clustering, random fourier features, novelty detection, scalability
2405.02696 Report	DiffuseTrace: A Transparent and Flexible Watermarking Scheme for Latent Diffusion Model	Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu	Latent Diffusion Models (LDMs) enable a wide range of applications but raise ethical concerns regarding illegal utilization.Adding watermarks to generative model outputs is a vital technique employed for copyright tracking and mitigating potential risks associated with AI-generated content. However, post-hoc watermarking techniques are susceptible to evasion. Existing watermarking methods for LDMs can only embed fixed messages. Watermark message alteration requires model retraining. The stability of the watermark is influenced by model updates and iterations. Furthermore, the current reconstruction-based watermark removal techniques utilizing variational autoencoders (VAE) and diffusion models have the capability to remove a significant portion of watermarks. Therefore, we propose a novel technique called DiffuseTrace. The goal is to embed invisible watermarks in all generated images for future detection semantically. The method establishes a unified representation of the initial latent variables and the watermark information through training an encoder-decoder model. The watermark information is embedded into the initial latent variables through the encoder and integrated into the sampling process. The watermark information is extracted by reversing the diffusion process and utilizing the decoder. DiffuseTrace does not rely on fine-tuning of the diffusion model components. The watermark is embedded into the image space semantically without compromising image quality. The encoder-decoder can be utilized as a plug-in in arbitrary diffusion models. We validate through experiments the effectiveness and flexibility of DiffuseTrace. DiffuseTrace holds an unprecedented advantage in combating the latest attacks based on variational autoencoders and Diffusion Models.	DiffuseTrace is a plug-in multi-bit watermarking module for latent diffusion models that protects copyright and enables semantic tracing of generated images.	The illicit use of text-to-image models necessitates robust watermarking techniques for copyright protection, user tracing, and mitigating harmful content.	DiffuseTrace embeds watermarks into the initial latent variables of the model, subtly influencing the sampling phase without post-processing. It utilizes an encoder-decoder model for watermark embedding and extraction, ensuring semantic consistency and image quality.	DiffuseTrace exhibits superior robustness against various image processing techniques and state-of-the-art watermark removal attacks, including VAE and diffusion-based methods. It maintains high image quality and semantic consistency, as evidenced by NIQE, PIQE, and CLIP score metrics. DiffuseTrace is flexible, allowing for watermark message modification without retraining or fine-tuning the model, and generalizable across different diffusion model versions.	The paper acknowledges the potential for bit errors at the edges of watermark regions due to diffusion inversion and image processing. Future work may explore further optimization of error correction techniques to address this limitation.	latent diffusion model, model watermarking, copyright protection, image generation, deep learning
2405.02608 Report	UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model	Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xiang, Rakesh Ranjan, Denis Demandolx	Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.	This paper introduces UnSAMFlow, a novel unsupervised optical flow network that leverages object information from the Segment Anything Model (SAM) to enhance flow estimation accuracy, particularly around object boundaries and occlusion regions.	Traditional unsupervised optical flow methods struggle with occlusions and motion boundaries due to their reliance on low-level information and lack of object-level understanding. This paper addresses this by integrating SAM, a powerful image segmentation model, into the flow estimation process.	The paper proposes three key adaptations: (1) a self-supervised semantic augmentation module utilizing SAM masks, (2) a regional smoothness loss based on homography to enforce smooth motion within SAM segments, (3) a mask feature module to aggregate features from the same SAM mask for robustness.	UnSAMFlow outperforms state-of-the-art unsupervised methods on both KITTI and Sintel benchmarks, achieving 7.83% test error on KITTI-2015. The method produces clear optical flow estimations with sharp boundaries around objects, demonstrating the effectiveness of incorporating object-level information. UnSAMFlow generalizes well across different datasets and runs efficiently.	The performance of UnSAMFlow is dependent on the accuracy of SAM masks, which can be affected by factors like lighting conditions and motion blur. The lack of semantic class information in SAM outputs presents a limitation, suggesting an area for future improvement.	unsupervised optical flow, segment anything model (sam), semantic augmentation, homography smoothness loss, mask feature module
2405.02568 Report	ActiveNeuS: Active 3D Reconstruction using Neural Implicit Surface Uncertainty	Hyunseo Kim, Hyeonseo Yang, Taekyung Kim, YoonSung Kim, Jin-Hwa Kim, Byoung-Tak Zhang	Active learning in 3D scene reconstruction has been widely studied, as selecting informative training views is critical for the reconstruction. Recently, Neural Radiance Fields (NeRF) variants have shown performance increases in active 3D reconstruction using image rendering or geometric uncertainty. However, the simultaneous consideration of both uncertainties in selecting informative views remains unexplored, while utilizing different types of uncertainty can reduce the bias that arises in the early training stage with sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate views considering both uncertainties. ActiveNeuS provides a way to accumulate image rendering uncertainty while avoiding the bias that the estimated densities can introduce. ActiveNeuS computes the neural implicit surface uncertainty, providing the color uncertainty along with the surface information. It efficiently handles the bias by using the surface information and a grid, enabling the fast selection of diverse viewpoints. Our method outperforms previous works on popular datasets, Blender and DTU, showing that the views selected by ActiveNeuS significantly improve performance.	Proposes ActiveNeuS, an active 3D reconstruction framework that improves next-best view selection by considering both geometric and image rendering uncertainty using a novel acquisition function.	Existing methods for active 3D reconstruction using neural implicit representations typically consider only one type of uncertainty (color or density), leading to biased uncertainty integration and suboptimal view selection.	Introduces 'neural implicit surface uncertainty' to measure color prediction confidence. Leverages a surface grid and uncertainty grid to efficiently integrate color entropy, prioritizing views with high uncertainty in regions of incomplete reconstruction.	Outperforms previous methods in image rendering and mesh reconstruction on Blender and DTU datasets. Selects more diverse viewpoints, leading to better coverage of the scene. Significantly faster next-best view selection compared to ActiveNeRF.	Does not address combining uncertainties from different networks (e.g., NeRF and NeuS) for scenes with backgrounds. Future work includes applying ActiveNeuS to robotic active 3D reconstruction.	active learning, 3d reconstruction, neural implicit surface, uncertainty estimation, next-best view
2405.02386 Report	Rip-NeRF: Anti-aliasing Radiance Fields with Ripmap-Encoded Platonic Solids	Junchen Liu, Wenbo Hu, Zhuo Yang, Jianteng Chen, Guoliang Wang, Xiaoxue Chen, Yantong Cai, Huan-ang Gao, Hao Zhao	Despite significant advancements in Neural Radiance Fields (NeRFs), the renderings may still suffer from aliasing and blurring artifacts, since it remains a fundamental challenge to effectively and efficiently characterize anisotropic areas induced by the cone-casting procedure. This paper introduces a Ripmap-Encoded Platonic Solid representation to precisely and efficiently featurize 3D anisotropic areas, achieving high-fidelity anti-aliasing renderings. Central to our approach are two key components: Platonic Solid Projection and Ripmap encoding. The Platonic Solid Projection factorizes the 3D space onto the unparalleled faces of a certain Platonic solid, such that the anisotropic 3D areas can be projected onto planes with distinguishable characterization. Meanwhile, each face of the Platonic solid is encoded by the Ripmap encoding, which is constructed by anisotropically pre-filtering a learnable feature grid, to enable featurzing the projected anisotropic areas both precisely and efficiently by the anisotropic area-sampling. Extensive experiments on both well-established synthetic datasets and a newly captured real-world dataset demonstrate that our Rip-NeRF attains state-of-the-art rendering quality, particularly excelling in the fine details of repetitive structures and textures, while maintaining relatively swift training times.	This paper presents Rip-NeRF, a novel method employing Ripmap-encoded Platonic solids for anti-aliasing in neural radiance fields.	Existing NeRF methods struggle to effectively characterize anisotropic areas, leading to aliasing and blurring artifacts. This method aims to provide high-fidelity, anti-aliased renderings by accurately representing these areas.	The method uses Platonic Solid Projection to factorize 3D space onto 2D planes, then employs Ripmap Encoding, an anisotropic area-sampling technique, to accurately featurize projected anisotropic areas on these planes.	Rip-NeRF achieves state-of-the-art rendering quality on both synthetic and real-world datasets. It excels in rendering fine details in challenging areas, like specular highlights and repetitive structures. The method offers a flexible trade-off between quality and efficiency through the choice of Platonic solids.	The representation faces challenges for unbounded scenes due to self-occlusion and space warping. Future work could explore more advanced 3D to 2D mapping functions to address limitations in unbounded scenes.	neural radiance fields, anti-aliasing, anisotropic area-sampling, platonic solid projection, ripmap encoding
2405.02280 Report	DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos	Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki	View-predictive generative models provide strong priors for lifting object-centric images and videos into 3D and 4D through rendering and score distillation objectives. A question then remains: what about lifting complete multi-object dynamic scenes? There are two challenges in this direction: First, rendering error gradients are often insufficient to recover fast object motion, and second, view predictive generative models work much better for objects than whole scenes, so, score distillation objectives cannot currently be applied at the scene level directly. We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via 360-degree novel view synthesis. Our key insight is a "decompose-recompose" approach that factorizes the video scene into the background and object tracks, while also factorizing object motion into 3 components: object-centric deformation, object-to-world-frame transformation, and camera motion. Such decomposition permits rendering error gradients and object view-predictive models to recover object 3D completions and deformations while bounding box tracks guide the large object movements in the scene. We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study. Besides 4D scene generation, DreamScene4D obtains accurate 2D persistent point track by projecting the inferred 3D trajectories to 2D. We will release our code and hope our work will stimulate more research on fine-grained 4D understanding from videos.	DreamScene4D is the first video-to-4D scene generation approach to produce realistic 4D scene representations from complex multi-object videos with large object motion.	Existing video-to-4D methods struggle with multi-object scenes exhibiting fast motion, limiting their real-world applicability for tasks like robot perception and augmented reality.	DreamScene4D employs a "decompose-recompose" strategy: 1) decompose the video into background and object tracks, 2) lift objects to 3D using Gaussian Splatting and Score Distillation Sampling, 3) factorize and optimize object motion (object-centric, object-to-world, camera), 4) recompose the scene using monocular depth guidance.	Significantly outperforms state-of-the-art methods in generating 4D scenes from challenging DAVIS and Kubric videos, as well as self-captured videos with fast motion. Shows superior performance in user preference studies, highlighting its ability to generate more realistic and consistent 4D representations. Achieves accurate 2D persistent point tracking by projecting inferred 3D trajectories, even surpassing methods specifically trained for point tracking.	SDS prior struggles with videos captured from steep elevation angles. Scene composition can be suboptimal if rendered and estimated depths misalign. Future work includes exploring data-driven approaches for video-to-4D generation to overcome limitations.	video-to-4d, 4d scene generation, novel view synthesis, gaussian splatting, score distillation sampling
2405.02246 Report	What matters when building vision-language models?	Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh	The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.	This paper investigates critical design choices in Vision-Language Models (VLMs) through extensive experiments and introduces Idefics2, an efficient 8B parameter VLM achieving state-of-the-art performance in its size category.	Unsupported design decisions in VLMs hinder progress by obscuring performance drivers. This work aims to address this by systematically comparing design choices and their impact on performance, efficiency, and training stability.	The authors conduct ablations on various VLM components including pre-trained models, architectures, and data. They analyze the impact of these choices on performance across benchmarks like VQAv2, TextVQA, OKVQA, and COCO.	The quality of the language model backbone has a greater impact than the vision backbone on the final VLM performance. While cross-attention architectures excel with frozen backbones, fully autoregressive architectures outperform them when backbones are trained (using techniques like LoRA for stability). Reducing visual tokens with learned pooling and adapting pre-trained vision encoders to preserve aspect ratio/resolution improve efficiency without sacrificing performance.	The lack of a large, well-trained, open-source vision encoder is identified as a limitation in the field. Future work includes investigating more nuanced evaluation metrics for open-ended visual question answering tasks to better reflect model capabilities.	vision-language models, multimodal learning, model efficiency, benchmarking, open-source models
2405.02066 Report	WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights	Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim	The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.	This paper introduces a novel watermarking method applicable to both implicit and explicit Neural Radiance Fields (NeRF) representations.	Protecting the copyright of NeRF-based 3D representations is crucial with their increasing use in various applications, and existing methods are limited to a single representation type.	The method fine-tunes a pre-trained NeRF model to embed binary messages in the rendering process, utilizing discrete wavelet transform in the NeRF space and a deferred back-propagation technique with patch-wise loss.	The method achieves state-of-the-art performance in bit accuracy, exceeding previous methods, especially for longer message lengths. It maintains a good balance between watermark invisibility and reconstruction quality, evidenced by high PSNR, SSIM, and low LPIPS scores. The method exhibits robustness against various image distortions, including cropping, brightness changes, and JPEG compression.	Training the watermark decoder is time-consuming. The current implementation only allows for a single message per model, requiring retraining for each new message.	neural radiance fields, nerf, watermarking, copyright protection, discrete wavelet transform
2405.02005 Report	HoloGS: Instant Depth-based 3D Gaussian Splatting with Microsoft HoloLens 2	Miriam Jäger, Theodor Kapler, Michael Feßenbecker, Felix Birkelbach, Markus Hillemann, Boris Jutzi	In the fields of photogrammetry, computer vision and computer graphics, the task of neural 3D scene reconstruction has led to the exploration of various techniques. Among these, 3D Gaussian Splatting stands out for its explicit representation of scenes using 3D Gaussians, making it appealing for tasks like 3D point cloud extraction and surface reconstruction. Motivated by its potential, we address the domain of 3D scene reconstruction, aiming to leverage the capabilities of the Microsoft HoloLens 2 for instant 3D Gaussian Splatting. We present HoloGS, a novel workflow utilizing HoloLens sensor data, which bypasses the need for pre-processing steps like Structure from Motion by instantly accessing the required input data i.e. the images, camera poses and the point cloud from depth sensing. We provide comprehensive investigations, including the training process and the rendering quality, assessed through the Peak Signal-to-Noise Ratio, and the geometric 3D accuracy of the densified point cloud from Gaussian centers, measured by Chamfer Distance. We evaluate our approach on two self-captured scenes: An outdoor scene of a cultural heritage statue and an indoor scene of a fine-structured plant. Our results show that the HoloLens data, including RGB images, corresponding camera poses, and depth sensing based point clouds to initialize the Gaussians, are suitable as input for 3D Gaussian Splatting.	This paper introduces HoloGS, a novel workflow for instant 3D scene reconstruction using 3D Gaussian Splatting with data directly acquired from Microsoft HoloLens 2 sensors, eliminating the need for pre-processing steps like Structure from Motion.	This work is important because it leverages the real-time capabilities of HoloLens 2 for 3D Gaussian Splatting, potentially enabling instant 3D scene reconstruction and point cloud extraction without time-consuming pre-processing.	HoloGS utilizes HoloLens 2 sensor data, including RGB images, camera poses, and depth maps, to initialize and optimize 3D Gaussians for scene representation. The authors evaluate their approach by analyzing rendering quality, PSNR, and geometric accuracy of the densified point cloud extracted from Gaussian centers.	HoloGS with internal HoloLens data leads to relatively smooth convergence of 3D Gaussian Splatting, enabling the rendering of novel views that reasonably reflect the scene's geometry and appearance. The quality of results obtained using internal HoloLens data is lower compared to using pre-processed SfM data, indicating potential inaccuracies in HoloLens camera poses. Densified point cloud extraction from Gaussian centers provides a promising avenue for refilling sparse input point clouds, but requires further post-processing to address limitations like floater artifacts and non-uniform point density on low-textured surfaces.	The accuracy of HoloGS is limited by the precision of HoloLens camera poses, leading to blurriness and artifacts in rendering and point cloud extraction. Extracting the densified point cloud solely from Gaussian centers has limitations, such as floater artifacts and non-uniform point density on homogeneous surfaces, necessitating further post-processing and refinement.	3d gaussian splatting, microsoft hololens 2, depth sensor, point cloud, 3d reconstruction
2405.01825 Report	Improving Concept Alignment in Vision-Language Concept Bottleneck Models	Nithish Muthuchamy Selvaraj, Xiaobao Guo, Bingquan Shen, Adams Wai-Kin Kong, Alex Kot	Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs with concepts defined by human experts instead of LLM generated concepts to make them more trustworthy. In this work, we take a closer inspection on the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grain bird species classification and animal classification. Our investigations reveal that frozen VLMs, like CLIP, struggle to correctly associate a concept to the corresponding visual input despite achieving a high classification performance. To address this, we propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a few labeled concept examples to improve concept alignment (activate truthful visual concepts) in CLIP model. Extensive experiments on three benchmark datasets show that our approach substantially increases the concept accuracy and classification accuracy, yet requires only a fraction of the human-annotated concept labels. To further improve the classification performance, we also introduce a new class-level intervention procedure for fine-grain classification problems that identifies the confounding classes and intervenes their concept space to reduce errors.	This paper investigates the faithfulness of Vision-Language Concept Bottleneck Models (VL-CBM) for expert-defined concepts and proposes a Contrastive Semi-Supervised (CSS) learning approach to improve their concept alignment using a limited number of concept labels.	While VL-CBMs offer interpretability by leveraging human-understandable concepts, existing methods often exhibit poor concept alignment, hindering their reliability and faithfulness. This work addresses this issue to ensure that VL-CBMs accurately associate visual concepts with the corresponding image regions.	The authors introduce a CSS learning method that combines contrastive learning in the concept space with semi-supervised learning from a few labeled concept examples. This encourages consistent concept scores within classes and discriminates between classes while aligning predictions with ground truth.	CSS substantially increases concept accuracy on CUB (+39.1%), RIVAL (+18.63%), and AwA2 (+31.11%) datasets using only a small percentage of human-annotated concept labels. CSS enhances classification accuracy, surpassing black-box models on CUB and approaching their performance on other datasets. A proposed class-level intervention procedure effectively reduces errors for confounding classes in fine-grain classification, further improving overall performance.	VL-CBMs may struggle with ineffable concepts that are difficult to express in language. The assumption that all salient concepts are known beforehand may not hold for all tasks, limiting their applicability in such cases.	concept bottleneck model, interpretability, semi-supervised learning, vision-language models, concept alignment
2405.01536 Report	Customizing Text-to-Image Models with a Single Image Pair	Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu	Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.	This paper presents Paired Customization, a method for customizing text-to-image models using a single image pair to learn stylistic differences.	Existing model customization methods struggle to disentangle style from content when trained on single images, often leading to overfitting.	The method employs joint optimization of separate style and content LoRA weights, enforcing orthogonality for better disentanglement. It also introduces style guidance during inference for enhanced stylization control and content preservation.	Paired Customization successfully learns stylistic differences from a single image pair and applies them to new content while preserving structure. Quantitative evaluation demonstrates superior performance in style similarity and structure preservation compared to baselines. Human preference studies confirm user preference for Paired Customization over existing methods.	The method may struggle to transfer styles to categories significantly different from the training pair. Reliance on test-time optimization can be computationally demanding, suggesting future exploration of encoder-based approaches for efficiency.	text-to-image synthesis, model customization, style transfer, diffusion models, low-rank adaptation
2405.01533 Report	OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning	Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez	The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.	This paper presents OmniDrive, a holistic framework for autonomous driving that leverages large language models (LLMs) for enhanced 3D perception, reasoning, and planning.	Existing LLM-based driving systems struggle with 3D spatial understanding and often rely on limited open-loop benchmarks. OmniDrive addresses these limitations, aiming to improve decision-making and planning in complex driving scenarios.	The authors introduce a novel 3D MLLM architecture (OmniDrive-Agent) that uses sparse queries to process high-resolution multi-view video input, enabling efficient 3D perception. They also develop a comprehensive benchmark (OmniDrive-nuScenes) with visual question-answering tasks for evaluating 3D reasoning, counterfactual reasoning, and planning.	OmniDrive-Agent exhibits strong 3D reasoning capabilities, surpassing previous methods in counterfactual reasoning and open-loop planning tasks. The use of sparse queries and a Q-Former-styled design allows for efficient processing of multi-view video data, addressing limitations of prior LLM architectures. The proposed OmniDrive-nuScenes benchmark offers valuable insights into the capabilities and limitations of LLM-based autonomous driving systems.	The method's effectiveness needs further validation on larger datasets like nuPlan. Current counterfactual reasoning simulations don't incorporate reactions from other agents, requiring a more sophisticated closed-loop setup.	autonomous driving, large language models, 3d perception, counterfactual reasoning, planning
2405.01496 Report	LocInv: Localization-aware Inversion for Text-Guided Image Editing	Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer	Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL	This paper introduces Localization-aware Inversion (LocInv), a method enhancing text-guided image editing in text-to-image diffusion models by refining cross-attention maps using segmentation maps or bounding boxes as localization priors.	Existing text-guided image editing methods often struggle with unintended modifications outside the target area due to inaccuracies in cross-attention maps, especially in complex multi-object images. This method addresses this 'cross-attention leakage' issue.	LocInv utilizes dynamic prompt learning, updating token representations of objects at each denoising step. It optimizes similarity and overlapping losses to align cross-attention maps with provided localization priors. Additionally, it introduces an adjective binding loss to improve attribute editing by strengthening the connection between adjectives and their corresponding nouns.	LocInv significantly improves the accuracy of cross-attention maps, leading to more precise and controlled image editing. The method excels in local editing tasks, including Word-Swap (replacing an object) and Attribute-Edit (modifying object attributes), outperforming existing methods in both qualitative and quantitative evaluations. It effectively preserves the background and maintains semantic similarity between the original and edited objects, particularly in complex multi-object images.	The method's reliance on the size of cross-attention maps (smaller maps offer better semantic information but limit fine-grained control) poses a limitation. The frozen Stable Diffusion model’s inherent limitations in editing capabilities and high-frequency detail reconstruction might affect the quality of editing results.	image editing, text-to-image synthesis, diffusion models, cross-attention, localization priors
2405.01413 Report	MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors	Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Yixue Hao, Long Hu, Min Chen	Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.	Presents MiniGPT-3D, an efficient 3D-LLM that leverages 2D priors from 2D-LLMs to align 3D point clouds with LLMs, achieving state-of-the-art results with significantly reduced training costs.	Training 3D-LLMs is computationally expensive, hindering research and applications. This work introduces an efficient approach using 2D-LLMs as priors to bridge the modality gap between 3D point clouds and LLMs.	Introduces a four-stage training strategy: (1) Align point cloud encoder with 2D-LLM, (2) Transfer 2D knowledge to 3D, (3) Enhance 3D-language understanding with challenging tasks, (4) Utilize Mixture of Query Experts for adaptive feature aggregation. Employs parameter-efficient fine-tuning methods and an efficient LLM backbone.	Achieves state-of-the-art performance on generative 3D object classification, outperforming baselines by significant margins on ModelNet40 and Objaverse datasets. Sets new state-of-the-art in 3D object captioning, demonstrating superior detail comprehension and accuracy compared to existing methods. Exhibits strong generalization ability, robustness to prompt variations, and comprehensive understanding of 3D objects, enabling detailed captioning and open-ended dialogue.	Limited to object-level understanding, not applicable to large-scale point clouds. Focuses on static 3D objects, lacking the ability to recognize actions in dynamic scenarios.	multimodal large language models, 3d point cloud understanding, efficiently multimedia alignment, mixture of experts, 2d priors
2405.01356 Report	Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance	Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang	In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.	This paper introduces Subject-Agnostic Guidance (SAG), a simple yet effective method to address the "content ignorance" issue in subject-driven text-to-image synthesis, where crucial text prompt attributes are often overlooked due to the strong influence of reference subject images.	The dominance of subject information in existing methods hinders the generation of diverse and text-aligned outputs, limiting the flexibility and creative potential of subject-driven synthesis.	SAG constructs a subject-agnostic embedding from user inputs and employs a dual classifier-free guidance (DCFG) strategy. This approach leverages the subject-agnostic embedding, especially in early generation stages, to prioritize text-guided content and structure before incorporating subject details.	SAG significantly improves text alignment without sacrificing subject fidelity, as demonstrated through qualitative and quantitative comparisons with existing methods like DreamBooth, Textual Inversion, and ELITE. The effectiveness of SAG is validated across different subject-driven synthesis approaches, including optimization-based, encoder-based, and second-order customization methods. User studies consistently show a strong preference for SAG-generated outputs, indicating its ability to achieve a desirable balance between content and subject consistency.	The quality of outputs generated with SAG is inherently limited by the capabilities of the underlying text-to-image generation model, which might struggle with uncommon content. Future work could explore incorporating a more robust synthesis network to further enhance the quality and diversity of outputs.	text-to-image synthesis, subject-driven generation, content ignorance, classifier-free guidance, subject-agnostic embedding
2405.01008 Report	On Mechanistic Knowledge Localization in Text-to-Image Generative Models	Samyadeep Basu, Keivan Rezaei, Priyatham Kattakinda, Ryan Rossi, Cherry Zhao, Vlad Morariu, Varun Manjunatha, Soheil Feizi	Identifying layers within text-to-image models which control visual attributes can facilitate efficient model editing through closed-form updates. Recent work, leveraging causal tracing show that early Stable-Diffusion variants confine knowledge primarily to the first layer of the CLIP text-encoder, while it diffuses throughout the UNet.Extending this framework, we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing fails in pinpointing localized knowledge, highlighting challenges in model editing. To address this issue, we introduce the concept of Mechanistic Localization in text-to-image models, where knowledge about various visual attributes (e.g., "style", "objects", "facts") can be mechanistically localized to a small fraction of layers in the UNet, thus facilitating efficient model editing. We localize knowledge using our method LocoGen which measures the direct effect of intermediate layers to output generation by performing interventions in the cross-attention layers of the UNet. We then employ LocoEdit, a fast closed-form editing method across popular open-source text-to-image models (including the latest SD-XL)and explore the possibilities of neuron-level model editing. Using Mechanistic Localization, our work offers a better view of successes and failures in localization-based text-to-image model editing. Code will be available at https://github.com/samyadeepbasu/LocoGen.	This paper introduces \crossprompt{}, a method for identifying localized control regions for visual attributes in text-to-image models, and explores efficient model editing using \crossedit{}.	Existing methods, like causal tracing, are not generalizable to newer text-to-image models, limiting the ability to interpret and edit these models effectively.	\crossprompt{} identifies controlling layers in the UNet by measuring the effect of altered text embeddings on specific visual attributes. \crossedit{} performs closed-form weight updates in identified layers for model editing.	\crossprompt{} successfully identifies unique control points for visual attributes across various text-to-image models. \crossedit{} enables efficient and interpretable model editing by updating specific layers in the UNet. The paper demonstrates the potential for neuron-level model editing by selectively dropping out neurons in identified layers.	Closed-form edits using \crossedit{} are not effective for DeepFloyd, likely due to the use of a bi-directional T5 text-encoder. Neuron-level editing, while promising, requires further investigation to address the trade-off between style removal and image quality.	text-to-image generation, model interpretability, model editing, knowledge localization, cross-attention
2405.00998 Report	Part-aware Shape Generation with Latent 3D Diffusion of Neural Voxel Fields	Yuhang Huang, SHilong Zou, Xinwang Liu, Kai Xu	This paper presents a novel latent 3D diffusion model for the generation of neural voxel fields, aiming to achieve accurate part-aware structures. Compared to existing methods, there are two key designs to ensure high-quality and accurate part-aware generation. On one hand, we introduce a latent 3D diffusion process for neural voxel fields, enabling generation at significantly higher resolutions that can accurately capture rich textural and geometric details. On the other hand, a part-aware shape decoder is introduced to integrate the part codes into the neural voxel fields, guiding the accurate part decomposition and producing high-quality rendering results. Through extensive experimentation and comparisons with state-of-the-art methods, we evaluate our approach across four different classes of data. The results demonstrate the superior generative capabilities of our proposed method in part-aware shape generation, outperforming existing state-of-the-art methods.	This paper introduces a novel latent 3D diffusion model for generating neural voxel fields with accurate part-aware structures.	Generating part-aware 3D shapes is important for downstream tasks such as editing, mix-and-match modeling, and segmentation learning. Existing methods are often part-oblivious or have limitations in generative ability and rendering quality.	The method uses a latent 3D diffusion process on a compressed latent space for high-resolution generation and a part-aware shape decoder that integrates part codes into the neural voxel field to guide accurate part decomposition.	Achieves higher resolution (96^3) than previous diffusion-based methods for neural fields, capturing richer details. Outperforms state-of-the-art methods in terms of FID metric across four different object classes (Chair, Table, Airplane, Car). Exhibits superior qualitative results, demonstrating accurate part-aware generation and high-quality rendering.	Collecting 2D semantic part maps for supervision can be challenging. Future work includes exploring pseudo-label based part-aware generation to reduce reliance on labeled data.	shape generation, diffusion model, part-aware generation, neural voxel fields, 3d deep learning
2405.00954 Report	X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation	Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji	Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.	This paper presents X-Oscar, a novel progressive framework for generating high-quality, animatable 3D avatars from text prompts.	Existing methods for text-guided 3D avatar generation often suffer from limitations like oversaturation and low-quality output, hindering their applicability in various domains like gaming and animation.	X-Oscar leverages the SMPL-X body model and adopts a sequential "Geometry→Texture→Animation" optimization strategy. It introduces two novel modules: (1) Adaptive Perturbation Module (APM) to represent avatars as adaptive distributions, mitigating oversaturation, and (2) Avatar-Aware Denoising Module (AADM) to incorporate geometry and appearance-aware noise for improved quality.	X-Oscar effectively addresses oversaturation in avatar generation, resulting in visually appealing and realistic outputs. The progressive modeling paradigm with separate optimization stages for geometry, texture, and animation significantly enhances the quality of generated avatars. Extensive evaluations, including user studies and comparisons with state-of-the-art methods, demonstrate X-Oscar's superiority in generating high-quality, animatable avatars consistent with text prompts.	The reliance on the SMPL-X model might limit the diversity of generatable body shapes. Exploring higher-resolution textures and more complex animation sequences could further enhance avatar realism.	3d avatar generation, text-guided synthesis, score distillation sampling, oversaturation mitigation, progressive modeling
2405.00942 Report	LLaVA Finds Free Lunch: Teaching Human Behavior Improves Content Understanding Abilities Of LLMs	Somesh Singh, Harini S I, Yaman K Singla, Veeky Baths, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy	Communication is defined as "Who says what to whom with what effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 40 video and image understanding tasks over 23 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.	This paper investigates whether training large language models (LLMs) on receiver behavior (e.g., likes, comments) can enhance their content understanding abilities.	Behavior data, though often discarded, implicitly carries rich signals about the content it interacts with. Leveraging this readily available resource could lead to significant improvements in content understanding tasks across various domains.	The authors collected a large-scale dataset (BLIFT) of images and videos from Reddit and YouTube, along with their corresponding comments and likes. They then fine-tuned LLaMA-Vid, a large vision and language model, on BLIFT to predict user behavior given the content. Ablation studies were conducted to compare the impact of different behavioral data types (perception vs. action) and sources.	Training on receiver behavior (Behavior-LLaVA) consistently outperformed the base LLaMA-Vid and a data-augmented variant (Ad-LLaVA) across 40 tasks and 23 benchmark datasets. The improvements were particularly pronounced for high-level understanding tasks like emotion recognition and persuasion strategy classification. Action-level behavior (comments, likes) proved more effective than perception-level behavior (saliency) for enhancing content understanding, likely due to its availability at scale.	The study primarily focused on comments and likes as behavioral signals, limiting the exploration of other rich action-level data. Future work could delve deeper into the relationship between specific behavioral patterns and different aspects of content understanding.	large language models, content understanding, behavior modeling, digital analytics, vision and language
2405.00915 Report	EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion	Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam	We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.	EchoScene, an interactive and controllable generative model for synthesizing 3D indoor scenes from scene graphs using a dual-branch diffusion model.	Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations.	EchoScene employs a dual-branch diffusion model with an information echo scheme. It associates each node with a denoising process in both shape and layout branches, enabling collaborative information exchange through an information exchange unit using graph convolution.	EchoScene outperforms previous methods in generation fidelity, achieving lower FID, FID CLIP, and KID scores. It demonstrates superior robustness in handling graph manipulation, accurately reflecting changes in node addition and relation adjustments. The method effectively maintains inter-object consistency, generating shapes and layouts that adhere to global scene graph constraints.	The model's reliance on a limited dataset may restrict the diversity of generated scenes. Exploration of alternative information exchange mechanisms within the echo scheme could further enhance generation quality.	scene graph, diffusion model, 3d scene generation, controllable generation, information exchange
2405.00900 Report	LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes	Shanlin Sun, Bingbing Zhuang, Ziyu Jiang, Buyu Liu, Xiaohui Xie, Manmohan Chandraker	Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.	This paper presents a novel framework leveraging Lidar data to enhance the quality of Neural Radiance Fields (NeRFs) for street scenes, particularly addressing challenges posed by sparse and collinear camera trajectories in autonomous driving scenarios.	Photorealistic simulation for applications like autonomous driving requires high-quality NeRFs, but street scenes with limited camera viewpoints and low-texture environments pose significant difficulties. Existing methods struggle to produce satisfactory results, necessitating improved techniques.	The proposed framework fuses Lidar-derived geometric features with the implicit grid-based representation of NeRFs. It introduces a robust occlusion-aware depth supervision scheme using densified Lidar points and generates augmented training views from Lidar projections to address view sparsity.	The method achieves state-of-the-art performance on the Pandaset benchmark, outperforming existing NeRF techniques in terms of visual fidelity and accuracy. The robust depth supervision scheme effectively utilizes dense Lidar data while mitigating errors caused by occlusions, leading to improved geometry reconstruction. Lidar encoding and augmented view supervision further enhance the rendering of fine details and improve performance in extrapolation scenarios, particularly for regions sparsely captured in the original data.	The current framework focuses on static backgrounds and does not handle dynamic objects. Future work could explore extending the insights of Lidar integration to model dynamic elements in street scenes.	neural radiance fields, lidar, autonomous driving, novel view synthesis, depth supervision
2405.00794 Report	Coherent 3D Portrait Video Reconstruction via Triplane Fusion	Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Josef Spjut, Henry Fuchs, Shalini De Mello, Koki Nagano	Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.	This paper presents a novel triplane fusion method for reconstructing coherent and high-fidelity 3D portrait videos from monocular RGB videos, aiming to improve the realism of 3D telepresence systems.	Existing single-image 3D reconstruction methods suffer from temporal inconsistency and struggle to maintain stable identity across frames. On the other hand, 3D self-reenactment methods, while temporally consistent, fail to faithfully reconstruct the dynamic appearance of users in real-time, such as expressions and lighting.	The proposed method leverages a pre-trained LP3D model to construct a personal triplane prior from a frontal reference image. For each input frame, a raw triplane is extracted using LP3D and then fused with the prior. This fusion process involves a Triplane Undistorter to remove view-dependent distortions and a Triplane Fuser to combine the undistorted triplane with the prior while preserving dynamic appearances.	The method successfully captures authentic dynamic appearances (e.g., facial expressions, lighting) while producing temporally consistent 3D videos. Trained solely on synthetic data generated from an expression-conditioned 3D GAN, the approach achieves state-of-the-art 3D reconstruction accuracy and temporal consistency on both in-studio and in-the-wild datasets. A new multi-view evaluation protocol is introduced to assess a method's robustness to input viewpoint variations and consistency across generated novel views.	Fusing side views with significantly different expressions compared to the reference view can result in blurry reconstructions due to triplane alignment ambiguity. The current implementation relies on a single reference image; incorporating multiple reference images with varying expressions and head poses could further enhance performance.	3d portrait video reconstruction, neural rendering, triplane representation, temporal consistency, single-view reconstruction
2405.00791 Report	Obtaining Favorable Layouts for Multiple Object Generation	Barak Battash, Amit Rozner, Lior Wolf, Ofir Lindenbaum	Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.	This paper proposes a novel approach to address the difficulty of existing diffusion models in generating images with multiple distinct subjects, focusing on preventing subject omission and merging in complex scene generation.	Generating images with multiple subjects is a critical challenge for text-to-image models as it's essential for creating complex and realistic scenes based on user prompts.	The proposed method uses a three-phase approach: 1) Excite and distinguish: Encourages distinct spatial representation for each subject's token in early diffusion steps. 2) Rearrange the generation grid: Extracts and optimizes subject masks to minimize overlap and rearranges the latent space accordingly. 3) Follow the masks: Guides subsequent diffusion steps to adhere to the optimized subject masks.	The method significantly outperforms existing state-of-the-art models in generating images with multiple subjects, showing reduced subject omission and blending. Quantitative evaluations using Llava1.5, Qwen-VL-Chat, and BLIP2 demonstrate substantial improvements across various metrics, especially with increasing subject numbers. The approach effectively combines with attribute binding techniques, further enhancing the overall quality and correctness of generated images.	The method increases inference time due to the multi-step optimization process. Forcing a specific layout can sometimes result in unnatural object arrangements or slightly reduced image quality, necessitating further improvements in mask generation and optimization strategies.	text-to-image synthesis, diffusion models, multi-subject generation, cross-attention maps, layout optimization
2405.00760 Report	Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models	Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, Hongsheng Li	Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.	The paper presents Deep Reward Tuning (DRTune), an algorithm for efficiently and effectively optimizing text-to-image diffusion models using differentiable rewards, particularly focusing on deep supervision for low-level rewards like symmetry.	Optimizing diffusion models with rewards is crucial for controlling image generation beyond traditional training datasets, but existing methods struggle with deep supervision of the iterative sampling process.	DRTune employs two key strategies: 1) stopping gradients of the denoising network input to alleviate gradient explosion and accelerate convergence, and 2) training on a subset of equally spaced sampling steps to improve efficiency.	DRTune consistently outperforms baselines like ReFL, DRaFT, and AlignProp on various rewards, including aesthetic score, CLIPScore, and human preference. It successfully optimizes low-level rewards like symmetry, which other methods fail to achieve due to limitations in deep supervision. Fine-tuning Stable Diffusion XL 1.0 with DRTune and HPS v2.1 results in Favorable Diffusion XL 1.0 (FDXL 1.0), exhibiting superior visual quality compared to the base model and comparable quality to Midjourney v5.2.	Reward hacking is a potential issue, necessitating strategies like regularization to prevent image quality degradation while optimizing for specific metrics. The paper acknowledges the potential negative social impact of advanced generative models, including the risk of generating misleading content and perpetuating biases.	diffusion models, text-to-image generation, reward learning, deep supervision, stable diffusion
2405.00676 Report	Spectrally Pruned Gaussian Fields with Neural Compensation	Runyi Yang, Zhenxin Zhu, Zhou Jiang, Baijun Ye, Xiaoxue Chen, Yifei Zhang, Yuantao Chen, Jian Zhao, Hao Zhao	Recently, 3D Gaussian Splatting, as a novel 3D representation, has garnered attention for its fast rendering speed and high rendering quality. However, this comes with high memory consumption, e.g., a well-trained Gaussian field may utilize three million Gaussian primitives and over 700 MB of memory. We credit this high memory footprint to the lack of consideration for the relationship between primitives. In this paper, we propose a memory-efficient Gaussian field named SUNDAE with spectral pruning and neural compensation. On one hand, we construct a graph on the set of Gaussian primitives to model their relationship and design a spectral down-sampling module to prune out primitives while preserving desired signals. On the other hand, to compensate for the quality loss of pruning Gaussians, we exploit a lightweight neural network head to mix splatted features, which effectively compensates for quality losses while capturing the relationship between primitives in its weights. We demonstrate the performance of SUNDAE with extensive results. For example, SUNDAE can achieve 26.80 PSNR at 145 FPS using 104 MB memory while the vanilla Gaussian splatting algorithm achieves 25.60 PSNR at 160 FPS using 523 MB memory, on the Mip-NeRF360 dataset. Codes are publicly available at https://runyiyang.github.io/projects/SUNDAE/.	This paper introduces SUNDAE, a memory-efficient 3D Gaussian Splatting method that leverages spectral pruning on a primitive graph and a neural compensation head to reduce storage requirements while maintaining rendering speed and quality.	3D Gaussian Splatting (3DGS) suffers from high memory consumption due to the independence of its primitives. SUNDAE addresses this by modeling the relationship between primitives, enabling significant storage reduction.	The method constructs a graph based on Gaussian primitives and uses spectral graph pruning to remove redundant ones. A neural compensation head then mitigates the quality loss by integrating information from remaining primitives in the 2D feature domain.	SUNDAE achieves competitive rendering quality with significantly lower memory footprint compared to 3DGS and other state-of-the-art methods. Spectral pruning effectively retains essential scene information by balancing high-frequency details and low-frequency background. The neural compensation module successfully mitigates the quality loss caused by pruning, demonstrating the benefits of modeling primitive relationships.	Continuous pruning, explored as an alternative, shows potential for lower peak memory but less control over final memory footprint. Future work could explore more sophisticated graph construction methods and alternative neural compensation architectures.	3d gaussian splatting, graph signal processing, neural rendering, memory efficient, primitive pruning
2405.00672 Report	TexSliders: Diffusion-Based Texture Editing in CLIP Space	Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre	Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.	Introduces TexSliders, a novel diffusion-based method for editing textures using natural language prompts, by manipulating CLIP image embeddings.	Existing diffusion-based image editing methods, relying on attention maps, are not effective for textures due to the lack of distinct semantic regions in textures.	Defines editing directions in CLIP space using pairs of text prompts, leverages a texture diffusion prior, and prunes irrelevant dimensions to improve identity preservation.	Enables intuitive slider-based texture editing using natural language. Demonstrates superior performance compared to state-of-the-art image editing methods on textures. Generalizes to real photographs and allows combinations of multiple editing directions.	Performance depends on the quality of CLIP embeddings and the diffusion model's sensitivity to specific concepts. Formal definition of texture identity in the context of diffusion models requires further investigation.	texture editing, diffusion models, clip, image embedding, generative models
2405.00630 Report	Depth Priors in Removal Neural Radiance Fields	Zhihao Guo, Peng Wang	Neural Radiance Fields have achieved impressive results in 3D reconstruction and novel view generation. A significant challenge within NeRF involves editing reconstructed 3D scenes, such as object removal, which demands consistency across multiple views and the synthesis of high-quality perspectives. Previous studies have integrated depth priors, typically sourced from LiDAR or sparse depth estimates from COLMAP, to enhance NeRF's performance in object removal. However, these methods are either expensive or time-consuming. This paper proposes a new pipeline that leverages SpinNeRF and monocular depth estimation models like ZoeDepth to enhance NeRF's performance in complex object removal with improved efficiency. A thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset is conducted to demonstrate that COLMAP can be viewed as a cost-effective and scalable alternative for acquiring depth ground truth compared to traditional methods like LiDAR. This serves as the basis for evaluating the performance of monocular depth estimation models to determine the best one for generating depth priors for SpinNeRF. The new pipeline is tested in various scenarios involving 3D reconstruction and object removal, and the results indicate that our pipeline significantly reduces the time required for depth prior acquisition for object removal and enhances the fidelity of the synthesized views, suggesting substantial potential for building high-fidelity digital twin systems with increased efficiency in the future.	This paper presents a novel object removal pipeline for Neural Radiance Fields (NeRF) that integrates SpinNeRF with monocular depth estimation models like ZoeDepth.	Enhancing NeRF's object removal capabilities is crucial for applications like robot navigation in human-robot collaborative environments, but existing methods using LiDAR or COLMAP depth priors are either costly or time-consuming.	The authors evaluate COLMAP's dense depth reconstruction accuracy against KITTI datasets to establish its viability as a ground truth depth source. They then compare various monocular depth estimation models using COLMAP-generated depth on the SpinNeRF dataset, identifying ZoeDepth as the optimal choice. Finally, they integrate ZoeDepth with SpinNeRF to create the proposed pipeline.	COLMAP's dense depth reconstruction exhibits high accuracy, making it a viable alternative to expensive ground truth depth acquisition methods. ZoeDepth outperforms other monocular depth estimation models on the SpinNeRF dataset, delivering high-quality depth priors while minimizing computational overhead. Integrating ZoeDepth with SpinNeRF significantly reduces depth prior acquisition time and improves the fidelity of synthesized views, particularly in object removal scenarios.	The paper primarily focuses on the SpinNeRF model, potentially limiting the generalizability of findings to other NeRF architectures. Future work could explore the integration of alternative monocular depth estimation models or the development of specialized depth estimation techniques tailored for NeRF object removal.	neural radiance fields, monocular depth estimation, 3d editing, 3d reconstruction, object removal
2405.00587 Report	GraCo: Granularity-Controllable Interactive Segmentation	Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen	Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant results. In this work, we introduce Granularity-Controllable Interactive Segmentation (GraCo), a novel approach that allows precise control of prediction granularity by introducing additional parameters to input. This enhances the customization of the interactive system and eliminates redundancy while resolving ambiguity. Nevertheless, the exorbitant cost of annotating multi-granularity masks and the lack of available datasets with granularity annotations make it difficult for models to acquire the necessary guidance to control output granularity. To address this problem, we design an any-granularity mask generator that exploits the semantic property of the pre-trained IS model to automatically generate abundant mask-granularity pairs without requiring additional manual annotation. Based on these pairs, we propose a granularity-controllable learning strategy that efficiently imparts the granularity controllability to the IS model. Extensive experiments on intricate scenarios at object and part levels demonstrate that our GraCo has significant advantages over previous methods. This highlights the potential of GraCo to be a flexible annotation tool, capable of adapting to diverse segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo.	This paper introduces GraCo, a novel Granularity-Controllable Interactive Segmentation approach that allows users to precisely control the granularity of segmentation masks through an additional input parameter, resolving ambiguity without redundant outputs.	Current interactive segmentation methods either provide single-granularity outputs, ignoring potential ambiguity in user intent, or offer multi-granularity outputs with limited scalability and redundancy. GraCo addresses these issues by enabling flexible and precise control over segmentation granularity.	GraCo employs a two-stage approach: (1) an Any-Granularity mask Generator (AGG) automatically generates mask proposals of varying granularities and quantifies their granularity level, and (2) Granularity-Controllable Learning (GCL) leverages these mask-granularity pairs to fine-tune a pre-trained IS model, enabling it to understand and respond to user-specified granularity.	GraCo significantly outperforms state-of-the-art single-granularity IS methods on both object and part-level benchmarks. GraCo surpasses the multi-granularity IS approach SAM on all benchmarks, except for achieving comparable performance on the SA-1B dataset. Analysis of IoU-granularity curves confirms GraCo's ability to control segmentation granularity consistently with human cognition.	The randomness in interaction signals generated by AGG can lead to semantically inconsistent parts or noisy boundaries, impacting granularity controllability. The offline proposal generation in AGG creates a trade-off between storage space and granularity abundance. Exploring online fine-tuning for granularity controllability is a potential future direction.	interactive segmentation, granularity control, ambiguity resolution, any-granularity mask generation, granularity-controllable learning
2405.00466 Report	Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable	Haozhe Liu, Wentian Zhang, Bing Li, Bernard Ghanem, Jürgen Schmidhuber	Foundational generative models should be traceable to protect their owners and facilitate safety regulation. To achieve this, traditional approaches embed identifiers based on supervisory trigger-response signals, which are commonly known as backdoor watermarks. They are prone to failure when the model is fine-tuned with nontrigger data. Our experiments show that this vulnerability is due to energetic changes in only a few 'busy' layers during fine-tuning. This yields a novel arbitrary-in-arbitrary-out (AIAO) strategy that makes watermarks resilient to fine-tuning-based removal. The trigger-response pairs of AIAO samples across various neural network depths can be used to construct watermarked subpaths, employing Monte Carlo sampling to achieve stable verification results. In addition, unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths, where a mask-controlled trigger function is proposed to preserve the generation performance and ensure the invisibility of the embedded backdoor. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO; while the verification rates of other trigger-based methods fall from ~90% to ~70% after fine-tuning, those of our method remain consistently above 90%.	This paper introduces AIAO, a novel backdoor-based method for traceable ownership protection of diffusion models, designed to be robust against fine-tuning on downstream tasks.	With the increasing use of fine-tuned pre-trained diffusion models, it's crucial to develop methods for tracking their usage and protecting the intellectual property of the source model.	AIAO embeds backdoor identifiers in the feature space of lazy layers (layers that undergo minimal change during fine-tuning) using a mask-controlled trigger function and Monte Carlo sampling of subpaths to minimize the impact of busy layers.	AIAO maintains high response and verification success rates (over 90%) even after fine-tuning, significantly outperforming existing backdoor watermarking methods. Embedding the backdoor in lazy layers significantly improves robustness against fine-tuning removal. The mask-controlled trigger function effectively generates invisible triggers in the feature space, preserving generation performance.	The verification pipeline currently relies on access to feature maps, limiting its applicability to open-source or semi-open-source scenarios. Future work will focus on extending AIAO to black-box ownership protection where feature maps are inaccessible.	trustworthy ai, intellectual property protection, backdoor watermark, diffusion model, fine-tuning
2405.00448 Report	MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation	Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, Xiaodan Liang	This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 2) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. For the first issue, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. Besides, MMTryon's impressive performance on multi-items and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.	Introduces MMTryon, a multi-modal multi-reference virtual try-on framework generating high-quality compositional try-on results from text instructions and multiple garment images.	Addresses limitations in existing VITON methods like single-item try-on, lack of dressing style customization, and dependence on segmentation models leading to artifacts.	Leverages a multi-modality and multi-reference attention mechanism combining garment and dressing style information, employs a parsing-free garment encoder, and uses a scalable data generation pipeline to train without explicit segmentation.	Outperforms SOTA methods qualitatively and quantitatively on high-resolution benchmarks and in-the-wild tests. Demonstrates superior performance in multi-item, style-controllable try-on scenarios. Offers flexibility in trying on outfits from diverse sources and scenarios.	Data generation process limited by pretrained models, posing challenges for fine-grained details like cuffs and collars. Future work may focus on fine-tuning large models to construct more detailed datasets for enhanced generation.	virtual try-on, viton, multi-modal learning, compositional try-on, diffusion models
2405.00313 Report	Streamlining Image Editing with Layered Diffusion Brushes	Peyman Gholami, Robert Xiao	Denoising diffusion models have recently gained prominence as powerful tools for a variety of image generation and manipulation tasks. Building on this, we propose a novel tool for real-time editing of images that provides users with fine-grained region-targeted supervision in addition to existing prompt-based controls. Our novel editing technique, termed Layered Diffusion Brushes, leverages prompt-guided and region-targeted alteration of intermediate denoising steps, enabling precise modifications while maintaining the integrity and context of the input image. We provide an editor based on Layered Diffusion Brushes modifications, which incorporates well-known image editing concepts such as layer masks, visibility toggles, and independent manipulation of layers; regardless of their order. Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU, enabling real-time feedback and rapid exploration of candidate edits. We validated our method and editing system through a user study involving both natural images (using inversion) and generated images, showcasing its usability and effectiveness compared to existing techniques such as InstructPix2Pix and Stable Diffusion Inpainting for refining images. Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation, demonstrating its versatility and potential for enhancing creative workflows.	This paper introduces Layered Diffusion Brushes, a novel real-time image editing tool for refining AI-generated images by making localized adjustments to specific regions defined by user-drawn masks.	Existing AI image editing tools often lack the speed and precision for real-time, localized adjustments. This tool aims to fill this gap, providing artists and users with greater control over image manipulation.	The method leverages Latent Diffusion Models (LDMs) by introducing targeted random noise patterns into the latent space during the reverse diffusion process. Users control the edits through masks, text prompts, and adjustable parameters like brush strength and the number of editing steps. A layering system allows for non-destructive, independent edits on different parts of the image.	Layered Diffusion Brushes achieved significantly faster editing times compared to manual editing and other AI-based methods, enabling real-time feedback. A user study indicated that Layered Diffusion Brushes was perceived as more usable and intuitive compared to InstructPix2Pix and SD-Inpainting. The tool was found to be effective for tasks like object addition/removal, attribute modification, style mixing, and error correction, demonstrating its versatility in refining AI-generated images.	Some aspects of the user interface, such as layer management and blend options, could be further improved based on user feedback. Future work could explore incorporating advanced features like semantic guidance and integration with 3D models for even greater control and realism.	diffusion models, image editing, artistic control, real-time editing, user interface
2405.00293 Report	MoPEFT: A Mixture-of-PEFTs for the Segment Anything Model	Rajat Sahay, Andreas Savakis	The emergence of foundation models, such as the Segment Anything Model (SAM), has sparked interest in Parameter-Efficient Fine-Tuning (PEFT) methods that tailor these large models to application domains outside their training data. However, different PEFT techniques modify the representation of a model differently, making it a non-trivial task to select the most appropriate method for the domain of interest. We propose a new framework, Mixture-of-PEFTs methods (MoPEFT), that is inspired by traditional Mixture-of-Experts (MoE) methodologies and is utilized for fine-tuning SAM. Our MoPEFT framework incorporates three different PEFT techniques as submodules and dynamically learns to activate the ones that are best suited for a given data-task setup. We test our method on the Segment Anything Model and show that MoPEFT consistently outperforms other fine-tuning methods on the MESS benchmark.	This paper introduces MoPEFT, a new framework inspired by Mixture-of-Experts, which dynamically activates specific Parameter-Efficient Fine-Tuning (PEFT) techniques based on the data and task.	Fine-tuning large foundation models like SAM is computationally expensive. PEFT methods offer efficiency but their effectiveness varies. MoPEFT addresses this by selectively leveraging the strengths of different PEFT techniques.	MoPEFT integrates LoRA, Prefix Tuning, and Adapters as submodules. A gating mechanism learns to favor the most suitable PEFT method for a given task, dynamically switching between them.	MoPEFT consistently outperforms individual PEFT methods (LoRA, Prefix Tuning, Adapters) on the MESS benchmark across multiple domains. The gating mechanism effectively learns to prefer different PEFT techniques for different datasets, demonstrating its adaptive capability. Combining multiple PEFT techniques in MoPEFT often leads to better performance than the best-performing individual technique, suggesting synergistic effects.	The paper primarily focuses on three major domains from the MESS benchmark due to brevity. Further investigation is needed to fully understand the compounding effects observed when combining different PEFT methods.	parameter-efficient fine-tuning, foundation models, segment anything model, mixture-of-experts, semantic segmentation
2405.00256 Report	ASAM: Boosting Segment Anything Model with Adversarial Tuning	Bo Li, Haoke Xiao, Lv Tang	In the evolving landscape of computer vision, foundation models have emerged as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks. Among these, the Segment Anything Model (SAM) by Meta AI has distinguished itself in image segmentation. However, SAM, like its counterparts, encounters limitations in specific niche applications, prompting a quest for enhancement strategies that do not compromise its inherent capabilities. This paper introduces ASAM, a novel methodology that amplifies SAM's performance through adversarial tuning. We harness the potential of natural adversarial examples, inspired by their successful implementation in natural language processing. By utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B dataset, generating adversarial instances that are more representative of natural variations rather than conventional imperceptible perturbations. Our approach maintains the photorealism of adversarial examples and ensures alignment with original mask annotations, thereby preserving the integrity of the segmentation task. The fine-tuned ASAM demonstrates significant improvements across a diverse range of segmentation tasks without necessitating additional data or architectural modifications. The results of our extensive evaluations confirm that ASAM establishes new benchmarks in segmentation tasks, thereby contributing to the advancement of foundational models in computer vision. Our project page is in https://asam2024.github.io/.	Introduces ASAM, a method enhancing SAM's performance using adversarial tuning inspired by natural adversarial examples in NLP.	To boost SAM's generalization ability without using extra data, changing its architecture, or hurting its zero-shot capabilities.	Projects natural images onto a low-dimensional manifold via a generative model, optimizes the latent representation, and fine-tunes SAM with the generated adversarial examples.	ASAM outperforms other SAM tuning methods on 14 diverse segmentation datasets. ASAM maintains high image quality comparable to original images. ASAM framework successfully enhances performance of another large vision foundation model, EfficientSAM.	Lack of direct theoretical proof for the method's efficacy. Exploration of ASAM's application to other vision tasks beyond segmentation.	image segmentation, foundation models, adversarial tuning, stable diffusion, segment anything model (sam)
2405.00196 Report	Synthetic Image Verification in the Era of Generative AI: What Works and What Isn't There Yet	Diangarti Tariang, Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva	In this work we present an overview of approaches for the detection and attribution of synthetic images and highlight their strengths and weaknesses. We also point out and discuss hot topics in this field and outline promising directions for future research.	This paper presents an overview of methods for detecting and attributing synthetic images, highlighting their strengths, weaknesses, and future research directions.	The rise of generative AI, enabling easy creation of hyperrealistic synthetic images, poses significant threats to disinformation and propaganda. Automated tools for detecting and attributing such images are crucial for societal protection.	The paper reviews various data-driven methods, including those leveraging CNNs, transformers, and vision-language models. It also discusses techniques exploiting forensic cues, like low-level artifacts in the frequency domain and high-level semantic inconsistencies.	Diffusion model-generated images are harder to detect than those from GANs. Generalization remains a challenge, especially when there's a mismatch between training and test data. While attribution in closed-set scenarios is reliable, open-set attribution requires further research.	Most research treats detection and attribution as separate problems, while a joint approach could be more effective. Calibration of detectors for real-world scenarios, where a fixed threshold may not be suitable, needs more attention.	synthetic image detection, image attribution, generative ai, deep learning, digital forensics
2404.19760 Report	Lightplane: Highly-Scalable Components for Neural 3D Fields	Ang Cao, Justin Johnson, Andrea Vedaldi, David Novotny	Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. However, current designs for these 2D-3D mapping are memory-intensive, posing a significant bottleneck for existing methods and hindering new applications. In response, we propose a pair of highly scalable components for 3D neural fields: Lightplane Render and Splatter, which significantly reduce memory usage in 2D-3D mapping. These innovations enable the processing of vastly more and higher resolution images with small memory and computational costs. We demonstrate their utility in various applications, from benefiting single-scene optimization with image-level losses to realizing a versatile pipeline for dramatically scaling 3D reconstruction and generation. Code: \url{https://github.com/facebookresearch/lightplane}.	This paper introduces Lightplane, a framework with two highly scalable components, Renderer and Splatter, for efficiently mapping information between 2D images and neural 3D fields using hashed 3D representations like voxel grids and triplanes.	Existing methods for 2D-3D mapping in neural 3D fields are memory-intensive, limiting the use of image-level losses, the number of input views, and the scalability of 3D models. Lightplane addresses this bottleneck by significantly reducing memory usage.	Lightplane leverages a hybrid 3D representation combining hashed structures (e.g., voxel grids, triplanes) and MLPs. It fuses operations along rays instead of processing individual 3D points, recomputes intermediate values during backpropagation, and leverages the GPU memory hierarchy for speed.	Lightplane achieves up to four orders of magnitude reduction in memory consumption compared to autograd methods while maintaining comparable speed. It enables the use of image-level losses on high-resolution renders for single-scene optimization. Lightplane significantly boosts the scalability of 3D reconstruction and generation models, demonstrated by improvements in LRM and a novel viewset diffusion model for CO3Dv2.	Current implementation shows a performance gap between different 3D hash representations (voxel grids and triplanes). Rendering and splatting a large number of images is still time-consuming, despite being comparable in speed to existing methods.	neural 3d fields, 3d reconstruction, 3d generation, memory efficiency, scalability
2404.19759 Report	MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model	Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang	This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model (MLD). By employing one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (e.g., pelvis trajectory) in the vanilla motion space to control the generation process directly, similar to controlling other latent-free diffusion models for motion generation. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.	This work introduces MotionLCM, a novel model that enables real-time controllable motion generation by combining latent consistency distillation and a motion ControlNet.	Existing methods for controllable text-to-motion generation suffer from significant runtime inefficiency, hindering their applicability in real-time scenarios. MotionLCM addresses this issue by significantly accelerating motion generation without compromising quality.	The methodology consists of two key components: 1) Motion Latent Consistency Distillation: A consistency model is distilled from a pre-trained motion latent diffusion model to achieve efficient one-step or few-step motion generation. 2) Controllable Motion Generation in Latent Space: A motion ControlNet is incorporated into MotionLCM to enable control over motion generation using spatial signals like pelvis trajectory. Explicit control supervision is applied in the motion space to enhance controllability.	MotionLCM achieves real-time inference speed (~30ms per motion sequence), outperforming prior diffusion-based methods by a significant margin. Despite using only one-step inference, MotionLCM achieves comparable or even superior performance compared to existing state-of-the-art methods. The introduction of motion ControlNet and control supervision in the motion space allows MotionLCM to achieve high-quality controllable motion generation.	While MotionLCM excels in terms of speed and quality trade-off, methods using guided diffusion still outperform it in motion control performance, suggesting room for improvement. The paper acknowledges the issue of potential physical implausibility in generated motions and limitations in handling noisy or anomalous data, leaving these as future research directions.	motion generation, text-to-motion, motion control, latent consistency models, controlnet
2404.19758 Report	Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting	Paul Engstler, Andrea Vedaldi, Iro Laina, Christian Rupprecht	3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.	This paper presents a novel depth completion model for 3D scene generation and a new benchmark for evaluating the geometric quality of generated scenes.	Current 3D scene generation methods often produce geometrically inconsistent scenes and rely on image-based metrics for evaluation, neglecting the underlying geometry.	The authors propose a depth completion model trained via teacher distillation and self-training to learn the 3D fusion process. They also introduce a benchmark based on ground truth geometry to evaluate the depth accuracy of generated scenes.	The proposed depth completion model significantly reduces geometric artifacts compared to existing methods. The new benchmark effectively uncovers geometric inconsistencies in existing scene generation approaches. The authors demonstrate their approach in a 360-degree scene generation pipeline, showcasing its ability to create immersive and geometrically consistent scenes.	The training dataset for the depth completion model is limited to specific scene types. The evaluation benchmark relies on the availability of ground truth depth data.	scene generation, novel view synthesis, 3d geometry, depth completion, benchmarking
2404.19753 Report	DOCCI: Descriptions of Connected and Contrasting Images	Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge	Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.	Introduces DOCCI, a vision-language dataset with 15k images and detailed human-annotated descriptions (avg. 136 words), focusing on fine-grained details and challenging aspects for T2I models.	Addresses limitations of existing datasets that lack descriptions with fine-grained detail needed for models to learn richer associations, hindering research on T2I models and their real-world applications.	Images curated to include contrastive sets and test specific T2I challenges. Three-stage annotation process ensures detailed and high-quality descriptions, with rigorous quality control.	DOCCI serves as an effective training resource for I2T generation, as demonstrated by improved performance of a PaLI 5B model finetuned on DOCCI. DOCCI highlights limitations of current T2I models, particularly in handling long descriptions, fine details, and challenges like spatial relationships, counting, and text rendering. Reveals discrepancies between automatic metrics (e.g., FID, CLIPScore) and human evaluation for long descriptions, emphasizing the need for better metrics.	DOCCI images are sourced from a single photographer, potentially introducing bias. Lack of reliable automatic metrics for evaluating long, detailed image descriptions.	vision-language, text-to-image generation, image-to-text generation, dataset, evaluation
2404.19752 Report	Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation	Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui	Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.	This paper introduces VisualFactChecker (VFC), a training-free pipeline for generating detailed and accurate captions for both 2D images and 3D objects. VFC addresses limitations of existing captioning methods, such as hallucination and lack of detail, by combining multiple models and a fact-checking step.	Accurate and detailed image captioning is crucial for various applications, including image retrieval, accessibility, and understanding visual content. Existing methods often produce captions that are either too short or hallucinate details not present in the image. VFC aims to bridge this gap by ensuring both detail and accuracy in generated captions.	VFC uses a three-step process: 1) Proposal: Multiple captioning models generate initial captions. 2) Verification: An LLM employs object detection and VQA models to verify the proposed captions, reducing hallucinations. 3) Captioning: The LLM synthesizes the verified information into a final detailed and accurate caption. For 3D objects, VFC generates captions for multiple views and combines them.	VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset, as measured by CLIP-Score and human evaluation. The paper proposes a novel caption evaluation metric called CLIP-Image-Score, which compares the input image with a reconstructed image generated from the caption using a text-to-image model. This helps assess caption fidelity and detect hallucinations. The study demonstrates that combining open-source models in a pipeline with an LLM can achieve captioning performance comparable to proprietary models like GPT-4V.	One limitation is the reliance on multiple models, which could increase computational cost and complexity. The current fact-checking process still has room for improvement, particularly in automatically determining which components to use for optimal results.	image captioning, hallucination mitigation, large language models, multimodal learning, 3d object captioning
2404.19702 Report	GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting	Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu	We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ .	This paper proposes GS-LRM, a scalable transformer-based Large Reconstruction Model (LRM) that predicts 3D Gaussian primitives from sparse posed images, enabling fast and high-quality 3D reconstruction for both objects and scenes.	Existing LRMs rely on triplane NeRF representation, which suffers from limitations in resolution, rendering speed, and scalability to large scenes. GS-LRM overcomes these limitations by directly predicting per-pixel Gaussians, leading to improved quality, speed, and scalability.	GS-LRM uses a simple transformer architecture: input posed images are patchified, processed by transformer blocks, and decoded into per-pixel Gaussian parameters. It is trained on Objaverse and RealEstate10K datasets for object and scene reconstruction, respectively.	GS-LRM achieves state-of-the-art reconstruction quality, outperforming previous methods by a large margin (4dB PSNR improvement for objects, 2.2dB for scenes). The model is fast, reconstructing a scene in ~0.23 seconds on a single A100 GPU. GS-LRM demonstrates strong performance in downstream 3D generation tasks, such as text-to-3D and image-to-3D.	The current model has a limited working resolution of 512x904 and requires known camera parameters. Future work will focus on increasing the resolution, handling unknown camera poses, and improving the reconstruction of unseen regions.	large reconstruction models, 3d reconstruction, gaussian splatting, transformers, sparse-view reconstruction
2404.19696 Report	Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu	3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.	Proposes Language-Regularized Concept Learner (LRC), a neuro-symbolic model that uses language constraints as regularization for 3D visual grounding in naturally supervised settings.	Addresses the limitations of current 3D visual grounding models that rely on dense supervision (e.g., object labels) which is expensive and difficult to obtain.	LRC leverages LLMs to distill language constraints (symmetry, exclusivity, synonymity) and applies these constraints as regularization losses and data augmentation during training.	LRC significantly outperforms prior neuro-symbolic methods and achieves comparable performance to end-to-end methods in naturally supervised 3D referring expression comprehension. LRC demonstrates strong zero-shot generalization to unseen concepts via language composition rules. LRC exhibits superior data efficiency and transferability to new datasets compared to previous approaches.	Reliance on object detectors like VoteNet introduces noise in bounding box predictions. Exploiting a wider range of language priors beyond the three explored could further enhance performance.	3d visual grounding, neuro-symbolic learning, natural language supervision, language constraints, referring expression comprehension
2404.19567 Report	Causal Perception Inspired Representation Learning for Trustworthy Image Quality Assessment	Lei Wang, Desen Yuan	Despite great success in modeling visual perception, deep neural network based image quality assessment (IQA) still remains unreliable in real-world applications due to its vulnerability to adversarial perturbations and the inexplicit black-box structure. In this paper, we propose to build a trustworthy IQA model via Causal Perception inspired Representation Learning (CPRL), and a score reflection attack method for IQA model. More specifically, we assume that each image is composed of Causal Perception Representation (CPR) and non-causal perception representation (N-CPR). CPR serves as the causation of the subjective quality label, which is invariant to the imperceptible adversarial perturbations. Inversely, N-CPR presents spurious associations with the subjective quality label, which may significantly change with the adversarial perturbations. To extract the CPR from each input image, we develop a soft ranking based channel-wise activation function to mediate the causally sufficient (beneficial for high prediction accuracy) and necessary (beneficial for high robustness) deep features, and based on intervention employ minimax game to optimize. Experiments on four benchmark databases show that the proposed CPRL method outperforms many state-of-the-art adversarial defense methods and provides explicit model interpretation.	This paper proposes Causal Perception inspired Representation Learning (CPRL) to enhance the trustworthiness and adversarial robustness of image quality assessment (IQA) models.	Existing deep learning-based IQA models are vulnerable to adversarial perturbations, highlighting their unreliability in real-world applications. This work addresses this limitation by focusing on the causal relationship between image features and perceived quality.	The proposed CPRL method introduces a novel channel-wise activation function within a causal framework. This function, based on soft ranking and a minimax game training strategy, aims to extract causal perception representations (CPR) from images while mitigating the influence of non-causal features.	CPRL significantly improves the robustness of IQA models against adversarial attacks like FGSM and PGD, as demonstrated by higher SRCC and PLCC values compared to existing methods. The learned representations exhibit greater stability in channel activations for adversarial examples, indicating the effectiveness of CPRL in capturing causal features. CPRL also achieves competitive performance on clean images, suggesting its capability to improve both robustness and accuracy in IQA.	The training process of CPRL requires additional optimization steps, leading to higher computational overhead compared to conventional IQA models. The intervention method based on prediction might not be perfectly accurate and has room for further improvement in future work.	image quality assessment, adversarial robustness, causal inference, representation learning, trustworthy ai
2404.19525 Report	MicroDreamer: Zero-shot 3D Generation in $\sim$20 Seconds by Score-based Iterative Reconstruction	Luxi Chen, Zhengyi Wang, Chongxuan Li, Tingting Gao, Hang Su, Jun Zhu	Optimization-based approaches, such as score distillation sampling (SDS), show promise in zero-shot 3D generation but suffer from low efficiency, primarily due to the high number of function evaluations (NFEs) required for each sample. In this paper, we introduce score-based iterative reconstruction (SIR), an efficient and general algorithm for 3D generation with a multi-view score-based diffusion model. Given the images produced by the diffusion model, SIR reduces NFEs by repeatedly optimizing 3D parameters, unlike the single optimization in SDS, mimicking the 3D reconstruction process. With other improvements including optimization in the pixel space, we present an efficient approach called MicroDreamer that generally applies to various 3D representations and 3D generation tasks. In particular, retaining a comparable performance, MicroDreamer is 5-20 times faster than SDS in generating neural radiance field and takes about 20 seconds to generate meshes from 3D Gaussian splitting on a single A100 GPU, halving the time of the fastest zero-shot baseline, DreamGaussian. Our code is available at https://github.com/ML-GSAI/MicroDreamer.	This paper proposes score-based iterative reconstruction (SIR), an efficient and general algorithm for zero-shot 3D generation using multi-view diffusion models.	Existing optimization-based 3D generation methods, while promising, suffer from low efficiency due to high function evaluation counts and optimization within the latent space.	SIR mimics the 3D reconstruction process by repeatedly optimizing 3D parameters given diffusion model outputs, reducing function evaluations. It also enables optimization directly in pixel space for further efficiency gains.	SIR achieves a 5-20 times speedup for NeRF generation compared to score distillation sampling. The proposed MicroDreamer system generates high-quality meshes from 3D Gaussian splatting in about 20 seconds. MicroDreamer matches the speed of feed-forward methods while remaining zero-shot, achieving competitive generation quality.	The quality of generated objects is limited by the quality of the multi-view diffusion model outputs. Further efficiency improvements may be possible with alternative sampling models or consistency models.	3d generation, diffusion model, zero-shot learning, score distillation sampling, multi-view diffusion
2404.19475 Report	TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models	Teng Zhou, Yongchuan Tang	Diffusion models have emerged as effective tools for generating diverse and high-quality content. However, their capability in high-resolution image generation, particularly for panoramic images, still faces challenges such as visible seams and incoherent transitions. In this paper, we propose TwinDiffusion, an optimized framework designed to address these challenges through two key innovations: Crop Fusion for quality enhancement and Cross Sampling for efficiency optimization. We introduce a training-free optimizing stage to refine the similarity of the adjacent image areas, as well as an interleaving sampling strategy to yield dynamic patches during the cropping process. A comprehensive evaluation is conducted to compare TwinDiffusion with the existing methods, considering factors including coherence, fidelity, compatibility, and efficiency. The results demonstrate the superior performance of our approach in generating seamless and coherent panoramas, setting a new standard in quality and efficiency for panoramic image generation.	The paper proposes TwinDiffusion, an optimized framework for generating high-resolution panoramic images with diffusion models, enhancing coherence and efficiency.	Existing methods struggle to generate seamless and coherent panoramic images, often exhibiting visible seams and incoherent transitions, especially in high-resolution.	TwinDiffusion introduces two key innovations: (1) Crop Fusion: a training-free optimization stage to refine the similarity of adjacent image areas, ensuring smoother transitions. (2) Cross Sampling: an interleaving sampling strategy using dynamic strides during cropping, maintaining quality while improving efficiency.	TwinDiffusion generates significantly more coherent panoramic images with fewer visible seams compared to baselines. Quantitative evaluation shows superior performance across various metrics, including LPIPS, DISTS, FID, IS, CLIP, and CLIP-aesthetic, without compromising efficiency. The paper analyzes the impact of key factors like optimization timestep, adjacent control, view stride, and cross stride on the quality-efficiency trade-off.	The method might struggle to maintain spatial logic in the overall panorama layout while focusing on local coherence. Future work includes extending the framework to video synthesis and virtual reality applications.	panorama generation, diffusion models, image coherence, efficient sampling, high-resolution
2404.19417 Report	Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World	Wen Yin, Jian Lou, Pan Zhou, Yulai Xie, Dan Feng, Yuhua Sun, Tailai Zhang, Lichao Sun	Backdoor attacks have been well-studied in visible light object detection (VLOD) in recent years. However, VLOD can not effectively work in dark and temperature-sensitive scenarios. Instead, thermal infrared object detection (TIOD) is the most accessible and practical in such environments. In this paper, our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks, spanning both the digital and physical realms. We introduce two novel types of backdoor attacks on TIOD, each offering unique capabilities: Object-affecting Attack and Range-affecting Attack. We conduct a comprehensive analysis of key factors influencing trigger design, which include temperature, size, material, and concealment. These factors, especially temperature, significantly impact the efficacy of backdoor attacks on TIOD. A thorough understanding of these factors will serve as a foundation for designing physical triggers and temperature controlling experiments. Our study includes extensive experiments conducted in both digital and physical environments. In the digital realm, we evaluate our approach using benchmark datasets for TIOD, achieving an Attack Success Rate (ASR) of up to 98.21%. In the physical realm, we test our approach in two real-world settings: a traffic intersection and a parking lot, using a thermal infrared camera. Here, we attain an ASR of up to 98.38%.	This paper presents the first study on backdoor attacks against Thermal Infrared Object Detection (TIOD), highlighting vulnerabilities in both digital and physical environments.	TIOD is increasingly critical in various applications, including security monitoring and autonomous driving, making its security crucial.	The authors propose two novel backdoor attacks: Object-affecting Attack (OAA) and Range-affecting Attack (RAA), both leveraging temperature manipulation in trigger design.	Digital experiments demonstrate up to 98.21% attack success rate (ASR) across different parameters. Physical world tests in traffic intersection and parking lot scenarios achieve up to 98.38% ASR. Evaluations of potential countermeasures (pruning, fine-pruning, Neural Cleanse) show limited effectiveness.	The study primarily focuses on attacking cars, future work could explore vulnerabilities in other object classes. Further investigation into more robust defense mechanisms specifically designed for TIOD backdoor attacks is needed.	backdoor attacks, thermal infrared object detection, security vulnerability, temperature modulated triggering, physical world attacks
2404.19227 Report	Espresso: Robust Concept Filtering in Text-to-Image Models	Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan	Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.	\method is a robust content filter for text-to-image (\tti) models, which leverages CLIP embeddings of both unacceptable and acceptable concepts to identify and remove undesirable content from generated images.	\tti models, trained on vast unfiltered internet data, often memorize and generate images containing unacceptable concepts (e.g., copyright infringement, inappropriate content). Existing concept removal techniques are either ineffective, negatively impact utility, or lack robustness against adversarial prompts.	\method utilizes a CLIP-based classifier that projects the image embedding onto the vector connecting text embeddings of acceptable and unacceptable concepts. This restricts adversaries to manipulating prompts only along this vector. Further fine-tuning enhances effectiveness and utility by maximizing separation between text embeddings while preserving their pairing with image embeddings.	\method achieves high effectiveness with low CLIP accuracy (\sim5\%) on unacceptable concepts. It generally preserves utility with high normalized CLIP score (\sim93\%) on acceptable concepts. It demonstrates robustness against various attacks with low CLIP accuracy (\sim4\%) on adversarial prompts.	The certified robustness bound, while providing some guarantees, is loose and can be improved. Exploring the design of new attacks specifically targeting \method and utilizing adversarial training to further enhance its robustness is crucial.	text-to-image, concept removal, robustness, clip, adversarial prompts
2404.19204 Report	NeRF-Insert: 3D Local Editing with Multimodal Control Signals	Benet Oriol Sabat, Alessandro Achille, Matthew Trager, Stefano Soatto	We propose NeRF-Insert, a NeRF editing framework that allows users to make high-quality local edits with a flexible level of control. Unlike previous work that relied on image-to-image models, we cast scene editing as an in-painting problem, which encourages the global structure of the scene to be preserved. Moreover, while most existing methods use only textual prompts to condition edits, our framework accepts a combination of inputs of different modalities as reference. More precisely, a user may provide a combination of textual and visual inputs including images, CAD models, and binary image masks for specifying a 3D region. We use generic image generation models to in-paint the scene from multiple viewpoints, and lift the local edits to a 3D-consistent NeRF edit. Compared to previous methods, our results show better visual quality and also maintain stronger consistency with the original NeRF.	Presents NeRF-Insert, a framework for making local edits to NeRFs with flexible control using textual prompts, reference images, and 3D region specification (via masks or CAD models).	Addresses limitations of existing NeRF editing methods that struggle with local edits, often impacting the global scene structure and offering limited control over the editing process.	Utilizes a visual hull for 3D region definition, employs text-guided or image-guided inpainting (Stable Diffusion, PaintByExample), and introduces a novel loss term to constrain edits within the specified region.	Enables high-quality local edits with various control levels, including object insertion and scene modification. Demonstrates superior performance compared to previous methods (e.g., Instruct-NeRF2NeRF) in terms of edit quality and local consistency. Shows that image-guided inpainting often surpasses text-guided inpainting for complex prompts.	Suffers from artifacts similar to early SDS-based text-to-3D models (e.g., noise, inconsistency). Manual mask drawing can be challenging without a dedicated interface, and mesh/CAD models may not always be available.	3d editing, nerf, inpainting, diffusion models, visual hull
2404.19149 Report	SAGS: Structure-Aware 3D Gaussian Splatting	Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, Stefanos Zafeiriou	Following the advent of NeRFs, 3D Gaussian Splatting (3D-GS) has paved the way to real-time neural rendering overcoming the computational burden of volumetric methods. Following the pioneering work of 3D-GS, several methods have attempted to achieve compressible and high-fidelity performance alternatives. However, by employing a geometry-agnostic optimization scheme, these methods neglect the inherent 3D structure of the scene, thereby restricting the expressivity and the quality of the representation, resulting in various floating points and artifacts. In this work, we propose a structure-aware Gaussian Splatting method (SAGS) that implicitly encodes the geometry of the scene, which reflects to state-of-the-art rendering performance and reduced storage requirements on benchmark novel-view synthesis datasets. SAGS is founded on a local-global graph representation that facilitates the learning of complex scenes and enforces meaningful point displacements that preserve the scene's geometry. Additionally, we introduce a lightweight version of SAGS, using a simple yet effective mid-point interpolation scheme, which showcases a compact representation of the scene with up to 24$\times$ size reduction without the reliance on any compression strategies. Extensive experiments across multiple benchmark datasets demonstrate the superiority of SAGS compared to state-of-the-art 3D-GS methods under both rendering quality and model size. Besides, we demonstrate that our structure-aware method can effectively mitigate floating artifacts and irregular distortions of previous methods while obtaining precise depth maps. Project page https://eververas.github.io/SAGS/.	This paper introduces SAGS, a structure-aware 3D Gaussian Splatting method for novel view synthesis, that leverages local and global structural information of the scene to improve rendering quality and reduce storage requirements.	Current 3D Gaussian Splatting methods optimize Gaussian attributes independently, neglecting inherent 3D structure, leading to reduced quality and increased storage requirements. SAGS addresses this by incorporating structural inductive biases.	SAGS utilizes a curvature-aware densification step to augment the point cloud, followed by a structure-aware encoder based on graph neural networks to learn local-global features for each point. These features are then decoded into Gaussian attributes, including point displacements, ensuring structure preservation during optimization.	SAGS outperforms state-of-the-art 3D-GS methods in terms of rendering quality on benchmark datasets. SAGS effectively mitigates floating artifacts and preserves scene geometry, resulting in more accurate depth maps. SAGS significantly reduces storage requirements (up to 24x with SAGS-Lite) without sacrificing rendering speed.	SAGS-Lite, while compact, may lack some sharp details compared to the full SAGS model. Further exploration of alternative graph neural network architectures or point cloud processing techniques could further enhance performance.	novel view synthesis, 3d gaussian splatting, graph neural networks, structure-aware, point cloud processing
2404.19110 Report	EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars	Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos Vougioukas, Zoe Landgraf, Stavros Petridis, Maja Pantic	Head avatars animated by visual signals have gained popularity, particularly in cross-driving synthesis where the driver differs from the animated character, a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model, with a particular focus on its latent space for facial expression descriptors, and uncover several limitations with its ability to express intense face motions. To address these limitations, we propose substantial changes in both training pipeline and model architecture, to introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions, setting a new state-of-the-art result in the emotion transfer task, surpassing previous methods in both metrics and quality. Incorporate speech-driven mode to our model, achieving top-tier performance in audio-driven facial animation, making it possible to drive source identity through diverse modalities, including visual signal, audio, or a blend of both. We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions, filling the gap with absence of such data in existing datasets.	This paper introduces EMOPortraits, an enhanced one-shot head avatar model capable of transferring intense facial expressions, and incorporating speech-driven animation.	Accurately transferring intense and asymmetric facial expressions, especially in cross-driving synthesis, remains challenging. Additionally, few methods excel in high-quality talking heads with natural head movements and multimodal input options.	This work builds on the MegaPortraits model, enhancing its expression transfer through analysis and improvement of latent expression space. This includes reducing its dimensionality, introducing novel self-supervised losses (canonical volume loss and source-driver mismatch loss), and using a new multi-view video dataset (FEED) featuring intense and asymmetric expressions. For speech-driven animation, the authors disentangle expression and head pose in the latent space and introduce a novel PCA mouth loss to enhance lip synchronization.	EMOPortraits achieves state-of-the-art results in cross-driving emotion translation, outperforming existing models in user preference and FID scores. The proposed speech-driven mode demonstrates top-tier performance in audio-driven animation, comparable to leading methods in realism and facial dynamics. The authors introduce FEED, a novel multi-view video dataset capturing a wide range of intense and asymmetric facial expressions, addressing the limitations of existing datasets.	The model currently does not generate the avatar's body or shoulders. There are occasional struggles with accurate expression translation, especially with extensive head rotations.	one-shot head avatars, emotion transfer, speech-driven animation, facial expression dataset, cross-driving synthesis
2404.18929 Report	DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing	Minghao Chen, Iro Laina, Andrea Vedaldi	We consider the problem of editing 3D objects and scenes based on open-ended language instructions. The established paradigm to solve this problem is to use a 2D image generator or editor to guide the 3D editing process. However, this is often slow as it requires do update a computationally expensive 3D representations such as a neural radiance field, and to do so by using contradictory guidance from a 2D model which is inherently not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two ways. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. We do so by utilizing a training-free approach which integrates cues from the underlying 3D geometry of the scene. Second, given a multi-view consistent edited sequence of images of the object, we directly and efficiently optimize the 3D object representation, which is based on 3D Gaussian Splatting. Because it does not require to apply edits incrementally and iteratively, DGE is significantly more efficient than existing approaches, and comes with other perks such as allowing selective editing of parts of the scene.	Introduces Direct Gaussian Editor (DGE), a method for fast and efficient text-guided 3D object and scene editing using multi-view consistent image editing and direct optimization of 3D Gaussian Splatting representations.	Existing methods relying on 2D image generators or editors for 3D editing are slow due to iterative updates and struggle with multi-view consistency.	1. Modifies a 2D image editor (InstructPix2Pix) to be multi-view consistent using spatio-temporal attention and epipolar constraints. 2. Directly optimizes a 3D Gaussian Splatting representation based on the multi-view consistent edited images.	Significantly faster than previous iterative methods (approximately 4 minutes for a single edit). Achieves higher fidelity edits due to multi-view consistent editing in the image space. Allows for selective editing of specific regions within the 3D scene.	Limited ability to handle substantial geometric transformations due to reliance on the underlying image editor's capabilities. Performance can be affected by the quality and consistency of the initial 3D Gaussian Splatting reconstruction.	3d object editing, text-guided editing, gaussian splatting, multi-view consistency, diffusion models
2404.18928 Report	Stylus: Automatic Adapter Selection for Diffusion Models	Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica	Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.	This paper introduces \text{\name}, a novel algorithm that automatically selects and composes adapters for diffusion models to enhance image generation quality, guided by user prompts.	Fine-tuned adapters offer a cost-effective way to customize image generation, but manually selecting from the growing number of adapters is challenging. \text{\name} automates this process, enabling users to easily leverage the power of adapter composition for high-fidelity and diverse images.	\text{\name} employs a three-stage approach: 1) \textit{Refiner}: Generates textual descriptions and embeddings for adapters using a vision-language model and text encoder. 2) \textit{Retriever}: Fetches relevant adapters by comparing embeddings with the user prompt. 3) \textit{Composer}: Segments the prompt into tasks and assigns adapters to each, leveraging a long-context LLM and a binary masking scheme for diversity.	\text{\name} improves image quality and textual alignment, achieving higher CLIP and FID scores compared to base Stable Diffusion and other retrieval methods. Human evaluations demonstrate a strong preference (2:1) for images generated with \text{\name} over those from baseline checkpoints. By using a combination of masking and LLM temperature, \text{\name} generates highly diverse sets of images from a single prompt.	The \textit{composer} component, while efficient, can sometimes misinterpret prompts or select low-quality adapters, leading to errors in image generation. While \text{\name} improves diversity across prompts, it doesn't completely solve the issue of reduced diversity within a specific task when using an adapter.	image generation, diffusion models, adapter selection, retrieval-augmented generation, vision-language models
2404.18861 Report	A Survey on Vision Mamba: Models, Applications and Challenges	Rui Xu, Shu Yang, Yihui Wang, Bo Du, Hao Chen	Mamba, a recent selective structured state space model, performs excellently on long sequence modeling tasks. Mamba mitigates the modeling constraints of convolutional neural networks and offers advanced modeling capabilities similar to those of Transformers, through global receptive fields and dynamic weighting. Crucially, it achieves this without incurring the quadratic computational complexity typically associated with Transformers. Due to its advantages over the former two mainstream foundation models, Mamba exhibits great potential to be a visual foundation model. Researchers are actively applying Mamba to various computer vision tasks, leading to numerous emerging works. To help keep pace with the rapid advancements in computer vision, this paper aims to provide a comprehensive review of visual Mamba approaches. This paper begins by delineating the formulation of the original Mamba model. Subsequently, our review of visual Mamba delves into several representative backbone networks to elucidate the core insights of the visual Mamba. We then categorize related works using different modalities, including image, video, point cloud, multi-modal, and others. Specifically, for image applications, we further organize them into distinct tasks to facilitate a more structured discussion. Finally, we discuss the challenges and future research directions for visual Mamba, providing insights for future research in this quickly evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.	This paper presents a comprehensive survey of Vision Mamba models, examining their applications and challenges in computer vision.	Mamba, a novel selective structured state space model, shows significant promise as a foundation model for computer vision tasks due to its linear scalability and strong modeling capabilities, rivaling Transformers.	The paper provides an in-depth explanation of the Mamba model and reviews various visual Mamba adaptations, categorizing them by their backbone architecture, scanning techniques, and applications across different visual data modalities, including images, videos, multi-modal data, and point clouds.	Visual Mamba models achieve competitive results on benchmarks for image classification, object detection, instance segmentation, and semantic segmentation. They have been successfully applied to various image-level tasks, including generation, restoration, and medical image analysis. Mamba's efficiency and capacity for long-range modeling prove beneficial in video and multi-modal tasks, such as action recognition, video object segmentation, and visual question answering.	Challenges remain in addressing the inherent causality assumptions of Mamba for non-causal visual data and in scaling its performance to large datasets and networks. Future research directions include developing more efficient scanning techniques, fusion strategies with other model architectures like CNNs, and improving computational efficiency for real-world applications.	mamba, state space model, computer vision, vision transformer, sequence modeling
2404.18669 Report	Bootstrap 3D Reconstructed Scenes from 3D Gaussian Splatting	Yifei Gao, Jie Ou, Lei Wang, Jun Cheng	Recent developments in neural rendering techniques have greatly enhanced the rendering of photo-realistic 3D scenes across both academic and commercial fields. The latest method, known as 3D Gaussian Splatting (3D-GS), has set new benchmarks for rendering quality and speed. Nevertheless, the limitations of 3D-GS become pronounced in synthesizing new viewpoints, especially for views that greatly deviate from those seen during training. Additionally, issues such as dilation and aliasing arise when zooming in or out. These challenges can all be traced back to a single underlying issue: insufficient sampling. In our paper, we present a bootstrapping method that significantly addresses this problem. This approach employs a diffusion model to enhance the rendering of novel views using trained 3D-GS, thereby streamlining the training process. Our results indicate that bootstrapping effectively reduces artifacts, as well as clear enhancements on the evaluation metrics. Furthermore, we show that our method is versatile and can be easily integrated, allowing various 3D reconstruction projects to benefit from our approach.	This paper proposes a bootstrapping method using diffusion models to enhance the rendering of novel views in 3D Gaussian Splatting (3D-GS), improving the handling of unseen views and reducing artifacts.	3D-GS, while fast and high-quality, struggles with novel view synthesis, particularly those significantly different from training views. This limitation stems from insufficient sampling, leading to artifacts like distortion and aliasing.	The method uses a trained 3D-GS model to render novel views, then refines these renderings using a diffusion model. These refined images are incorporated back into the training process, guiding the 3D-GS model to learn better representations.	Bootstrapping effectively reduces artifacts in novel views, especially in challenging scenes with texture-less surfaces or limited observations. Significant improvement in quantitative metrics (PSNR, SSIM, LPIPS) compared to the original 3D-GS and other state-of-the-art methods. Demonstrated versatility and plug-and-play capability, showing promising results on multi-scale datasets.	Increased time consumption due to the diffusion process. Challenges in rendering specific views and generating high-frequency details consistently.	3d gaussian splatting, diffusion models, novel view synthesis, artifact reduction, neural rendering
2404.18630 Report	4D-DRESS: A 4D Dataset of Real-world Human Clothing with Semantic Annotations	Wenbo Wang, Hsuan-I Ho, Chen Guo, Boxiang Rong, Artur Grigorev, Jie Song, Juan Jose Zarate, Otmar Hilliges	The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing. Website: https://ait.ethz.ch/4d-dress.	The paper introduces 4D-DRESS, the first real-world 4D dataset of human clothing, containing high-quality 4D textured scans, vertex-level semantic labels, garment meshes, and registered SMPL/SMPL-X body models.	Existing human clothing research relies heavily on synthetic datasets, which lack realism and fail to capture authentic clothing dynamics. Real-world datasets are needed to bridge this gap.	The authors captured 520 motion sequences featuring 64 distinct outfits using a multi-view volumetric capture system. They developed a semi-automatic 4D human parsing pipeline to efficiently annotate the 78k frames with semantic labels.	The semi-automatic pipeline achieved accurate vertex label assignment without manual intervention in 96.8% of frames. Evaluation benchmarks for clothing simulation showed that 4D-DRESS poses a realistic challenge for existing algorithms. Benchmarks for clothed human reconstruction highlighted the difficulty of current methods in accurately reconstructing real-world clothing, especially loose garments.	The pipeline's computational cost and the manual effort required for rectification limit the dataset's scalability. Future work includes expanding the dataset with more diverse subjects and clothing, and developing real-time 4D annotation and rectification tools.	4d human clothing, dataset, semantic segmentation, clothing simulation, human reconstruction
2404.18620 Report	FlexiFilm: Long Video Generation with Flexible Conditions	Yichen Ouyang, jianhao Yuan, Hao Zhao, Gaoang Wang, Bo zhao	Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/	FlexiFilm, a novel latent video diffusion model specifically designed for generating long and consistent videos, addressing the limitations of existing methods in handling long-duration sequences.	Existing diffusion-based video generation models struggle with long videos, exhibiting temporal inconsistency and overexposure due to insufficient conditioning mechanisms and sampling strategies.	FlexiFilm introduces a temporal conditioner to enhance consistency by establishing relationships between generated frames and multi-modal conditions. It also employs a resampling strategy to mitigate overexposure and a co-training method for improved temporal coherence.	FlexiFilm generates high-quality videos of over 30 seconds, outperforming baselines in terms of length and consistency. Quantitative evaluations demonstrate FlexiFilm's superiority in visual quality (FVD) and inter-frame consistency. Ablation studies highlight the importance of the temporal conditioner, co-training, and resampling strategy in achieving long and consistent video generation.	The reliance on a large-scale driving dataset may limit generalization to other domains, necessitating further exploration with diverse datasets. The computational cost associated with long video generation remains a consideration for practical applications.	long video generation, conditional video generation, diffusion model, temporal consistency, resampling strategy
2404.18598 Report	Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting	Tianyidan Xie, Rui Ma, Qian Wang, Xiaoqian Ye, Feixuan Liu, Ying Tai, Zhenyu Zhang, Zili Yi	Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited diversity. In response, we introduce Anywhere, a pioneering multi-agent framework designed to address these issues. Anywhere utilizes a sophisticated pipeline framework comprising various agents such as Visual Language Model (VLM), Large Language Model (LLM), and image generation models. This framework consists of three principal components: the prompt generation module, the image generation module, and the outcome analyzer. The prompt generation module conducts a semantic analysis of the input foreground image, leveraging VLM to predict relevant language descriptions and LLM to recommend optimal language prompts. In the image generation module, we employ a text-guided canny-to-image generation model to create a template image based on the edge map of the foreground image and language prompts, and an image refiner to produce the outcome by blending the input foreground and the template image. The outcome analyzer employs VLM to evaluate image content rationality, aesthetic score, and foreground-background relevance, triggering prompt and image regeneration as needed. Extensive experiments demonstrate that our Anywhere framework excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity. It successfully elevates foreground-conditioned image inpainting to produce more reliable and diverse results.	Introduces "Anywhere", a multi-agent framework for foreground-conditioned image inpainting that addresses issues like "over-imagination", inconsistencies, and limited diversity in existing end-to-end models.	Current image inpainting methods struggle to generate reliable and diverse results for foreground-conditioned image completion, often producing illogical or repetitive outputs.	Utilizes a pipeline with VLM, LLM, and image generation agents. A prompt generation module analyzes the foreground to create descriptive prompts. An image generation module generates a background template, refines it, and blends it with the foreground. An outcome analyzer evaluates the result and triggers prompt regeneration for improved quality.	Significantly reduces "over-imagination" by intelligently inpainting irrelevant content around the foreground. Generates more diverse and contextually relevant backgrounds compared to existing methods. Achieves a lower bad case rate and higher aesthetic scores than both open-source and commercial inpainting tools.	Faces challenges with transparent or semi-transparent foreground objects. The outcome analyzer struggles to accurately assess image rationality in terms of lighting and shadowing.	image inpainting, multi-agent systems, vision-language models, large language models, diffusion models
2404.18454 Report	3D Gaussian Splatting with Deferred Reflection	Keyang Ye, Qiming Hou, Kun Zhou	The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes from the environment map reflection model, which requires accurate surface normal while simultaneously bottlenecks normal estimation with discontinuous gradients. We leverage the per-pixel reflection gradients generated by deferred shading to bridge the optimization process of neighboring Gaussians, allowing nearly correct normal estimations to gradually propagate and eventually spread over all reflective objects. Our method significantly outperforms state-of-the-art techniques and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic and real-world scenes, while running at a frame rate almost identical to vanilla Gaussian splatting.	This paper presents a deferred shading method to render high-quality specular reflection with Gaussian splatting for novel view synthesis, addressing the challenge of modeling high-frequency specular reflection.	Specular reflection is a challenging aspect of novel view synthesis, and existing Gaussian splatting methods struggle to model it accurately, often resulting in poor visual quality and compromised geometry.	The method employs a two-pass rendering pipeline: a Gaussian splatting pass to generate screen-space maps of base color, normal, and reflection strength, followed by a deferred reflection pass using an environment map for specular reflection. A novel training algorithm featuring normal propagation is introduced to address the challenge of accurate normal estimation.	Significantly outperforms state-of-the-art methods and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent PSNR improvement for both synthetic and real-world scenes. Achieves real-time frame rates almost identical to vanilla Gaussian splatting, thanks to its efficient deferred shading pipeline and reduced reliance on splitting Gaussians. Produces accurate normal and environment map estimations due to the pixel-level reflection computation and effective normal propagation during training.	Limited to handling one layer of reflective materials per pixel due to the inherent limitation of traditional deferred shading. Normal propagation is less efficient on concave scenes, leading to slower convergence during training.	novel view synthesis, deferred shading, gaussian splatting, specular reflection, real-time rendering
2404.18409 Report	PKU-AIGIQA-4K: A Perceptual Quality Assessment Database for Both Text-to-Image and Image-to-Image AI-Generated Images	Jiquan Yuan, Fanyi Yang, Jihe Li, Xinyan Cao, Jinming Che, Jinlong Lin, Xixin Cao	In recent years, image generation technology has rapidly advanced, resulting in the creation of a vast array of AI-generated images (AIGIs). However, the quality of these AIGIs is highly inconsistent, with low-quality AIGIs severely impairing the visual experience of users. Due to the widespread application of AIGIs, the AI-generated image quality assessment (AIGIQA), aimed at evaluating the quality of AIGIs from the perspective of human perception, has garnered increasing interest among scholars. Nonetheless, current research has not yet fully explored this field. We have observed that existing databases are limited to images generated from single scenario settings. Databases such as AGIQA-1K, AGIQA-3K, and AIGCIQA2023, for example, only include images generated by text-to-image generative models. This oversight highlights a critical gap in the current research landscape, underscoring the need for dedicated databases catering to image-to-image scenarios, as well as more comprehensive databases that encompass a broader range of AI-generated image scenarios. Addressing these issues, we have established a large scale perceptual quality assessment database for both text-to-image and image-to-image AIGIs, named PKU-AIGIQA-4K. We then conduct a well-organized subjective experiment to collect quality labels for AIGIs and perform a comprehensive analysis of the PKU-AIGIQA-4K database. Regarding the use of image prompts during the training process, we propose three image quality assessment (IQA) methods based on pre-trained models that include a no-reference method NR-AIGCIQA, a full-reference method FR-AIGCIQA, and a partial-reference method PR-AIGCIQA. Finally, leveraging the PKU-AIGIQA-4K database, we conduct extensive benchmark experiments and compare the performance of the proposed methods and the current IQA methods.	This paper introduces PKU-AIGIQA-4K, the first perceptual quality assessment database to include both text-to-image and image-to-image AI-generated images, addressing the lack of comprehensive datasets for evaluating AI-generated images.	Evaluating the quality of AI-generated images is crucial as their applications expand, however, existing databases only focus on single scenario settings limiting the scope of assessment.	The authors collect images generated by three popular models (Midjourney, Stable Diffusion, DALLE3) using both text and image prompts. They conduct subjective experiments to obtain quality labels and propose three IQA methods: NR-AIGCIQA (no-reference), FR-AIGCIQA (full-reference), and PR-AIGCIQA (partial-reference).	The PKU-AIGIQA-4K database demonstrates diverse perceptual scores across different generation methods and image types. The proposed PR-AIGCIQA method, leveraging image prompts, often outperforms the NR-AIGCIQA method, indicating the importance of using reference images. The study reveals that current IQA methods, including the proposed ones, still need improvement to better align with human perception of AIGIs.	The performance of different IQA methods varies significantly depending on the visual backbone network used. The current IQA methods and proposed methods require further refinement to improve their alignment with human preferences for AIGIs.	ai-generated images, image quality assessment, perceptual quality, text-to-image generation, image-to-image generation
2404.18343 Report	G-Refine: A General Quality Refiner for Text-to-Image Generation	Chunyi Li, Haoning Wu, Hongkun Hao, Zicheng Zhang, Tengchaun Kou, Chaofeng Chen, Lei Bai, Xiaohong Liu, Weisi Lin, Guangtao Zhai	With the evolution of Text-to-Image (T2I) models, the quality defects of AI-Generated Images (AIGIs) pose a significant barrier to their widespread adoption. In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module. Based on the mechanisms of the Human Visual System (HVS) and syntax trees, the first two indicators can respectively identify the perception and alignment deficiencies, and the last module can apply targeted quality enhancement accordingly. Extensive experimentation reveals that when compared to alternative optimization methods, AIGIs after G-Refine outperform in 10+ quality metrics across 4 databases. This improvement significantly contributes to the practical application of contemporary T2I models, paving the way for their broader adoption. The code will be released on https://github.com/Q-Future/Q-Refine.	The paper proposes G-Refine, a general image quality refiner for text-to-image generation designed to enhance low-quality images without degrading high-quality ones.	Existing text-to-image models often produce inconsistent results with varying quality, hindering their widespread adoption. Current optimization methods either lack text guidance or struggle to balance refinement between low and high-quality regions.	G-Refine uses three modules: a perception quality indicator (PQ-Map), an alignment quality indicator (AQ-Map), and a general quality enhancement module. PQ-Map leverages modified CLIP encoders and quality-related factors to identify perceptual deficiencies. AQ-Map analyzes prompt semantics through syntax trees to locate alignment issues. The enhancement module then uses these maps to guide a multi-stage denoising process.	G-Refine outperforms competing methods in over 90% of cases across 13 quality indicators and 4 datasets. G-Refine exhibits minimal negative optimization compared to other methods, indicating its ability to selectively enhance low-quality regions. Both PQ-Map and AQ-Map, when used independently, demonstrate strong performance in quality assessment tasks, even surpassing some state-of-the-art methods.	The alignment optimization effectiveness is less prominent on models with inherently high generation quality. Future work involves exploring further optimization for advanced text-to-image models, particularly in improving alignment quality.	text-to-image generation, image quality assessment, text-to-image alignment, image refinement, ai-generated content
2404.18284 Report	S3-SLAM: Sparse Tri-plane Encoding for Neural Implicit SLAM	Zhiyao Zhang, Yunzhou Zhang, Yanmin Wu, Bin Zhao, Xingshuo Wang, Rui Tian	With the emergence of Neural Radiance Fields (NeRF), neural implicit representations have gained widespread applications across various domains, including simultaneous localization and mapping. However, current neural implicit SLAM faces a challenging trade-off problem between performance and the number of parameters. To address this problem, we propose sparse tri-plane encoding, which efficiently achieves scene reconstruction at resolutions up to 512 using only 2~4% of the commonly used tri-plane parameters (reduced from 100MB to 2~4MB). On this basis, we design S3-SLAM to achieve rapid and high-quality tracking and mapping through sparsifying plane parameters and integrating orthogonal features of tri-plane. Furthermore, we develop hierarchical bundle adjustment to achieve globally consistent geometric structures and reconstruct high-resolution appearance. Experimental results demonstrate that our approach achieves competitive tracking and scene reconstruction with minimal parameters on three datasets. Source code will soon be available.	This paper proposes S3-SLAM, a neural implicit SLAM leveraging a novel sparse tri-plane encoding for rapid iteration and parameter sparsity in high-fidelity scene reconstruction.	Existing neural implicit SLAM methods struggle to balance performance with a manageable number of parameters, especially at high resolutions.	The sparse tri-plane encoding represents scenes compactly via 2D hash-grid plane features. S3-SLAM integrates this with multi-resolution encoding and a hierarchical bundle adjustment (HBA) for globally consistent geometry and high-resolution appearance reconstruction.	S3-SLAM achieves competitive tracking accuracy and high-fidelity scene reconstruction with minimal parameters on Replica, ScanNet, and TUM RGB-D datasets. The sparse tri-plane encoding reduces memory consumption to 2-4% of regular tri-plane encoding at 512 resolution. HBA improves local appearance details while maintaining global consistency.	The current approach lacks genuine local updates, potentially leading to forgetting issues. Future work will address this by implementing local update mechanisms.	neural implicit slam, sparse tri-plane encoding, neural rendering, 3d reconstruction, hierarchical bundle adjustment
2404.18212 Report	Paint by Inpaint: Learning to Add Image Objects by Removing Them First	Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel	Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.	Introduces Paint by Inpaint, a framework for image object addition that leverages the inverse relationship between object addition and removal. This framework is used to create PIPE, a large-scale object addition dataset, and train a diffusion model that achieves state-of-the-art performance on this task.	Existing text-guided, mask-free object addition methods struggle with consistency and rely on synthetic datasets. This paper addresses these limitations by creating a large-scale dataset with real image targets and inherent consistency.	PIPE is constructed by leveraging segmentation datasets and an inpainting model to remove objects from images, resulting in source-target pairs. Natural language instructions are generated using class names, VLM-LLM pipelines, and object reference datasets. A diffusion model is then trained on PIPE to perform object addition.	The trained model outperforms existing methods on object addition benchmarks, demonstrating superior fidelity to instructions and consistency. Human evaluation confirms the model's ability to produce higher-quality edits aligned with user instructions. Combining PIPE with general editing datasets enhances performance on broader editing tasks, indicating its potential beyond object addition.	The data curation pipeline, while robust, is not entirely error-free, potentially impacting dataset quality. The effectiveness of instruction generation relies on the capabilities of VLMs and LLMs, which can still exhibit limitations in producing human-like instructions.	image editing, object addition, diffusion models, vision-language models, dataset creation
2404.18136 Report	SafePaint: Anti-forensic Image Inpainting with Domain Adaptation	Dunyun Chen, Xin Liao, Xiaoshuai Wu, Shiwei Chen	Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.	This paper proposes SafePaint, an end-to-end training framework for anti-forensic image inpainting, enhancing image security and reliability by incorporating anti-forensic capabilities as an evaluation metric for inpainting quality.	Existing image inpainting methods excel in visual realism but often leave detectable tampering traces, posing security risks. This work addresses the need for inpainted images that can resist forensic analysis, aligning with human perception and improving inpainting quality assessment.	SafePaint decouples image inpainting into two stages: content completion and region-wise optimization. It leverages domain adaptation to minimize discrepancies between inpainted and unaltered regions, thereby enhancing anti-forensic performance. A novel region-wise separated attention (RWSA) module further improves anti-forensic capabilities and overall model performance.	SafePaint significantly outperforms state-of-the-art methods in anti-forensic capabilities based on evaluations using multiple forgery and inpainting detectors. The proposed method achieves comparable visual quality results to existing inpainting techniques, maintaining a balance between realism and security. Ablation studies confirm the effectiveness of domain distance loss and the RWSA module in enhancing anti-forensic performance.	The paper acknowledges a potential trade-off between visual quality and anti-forensic performance, suggesting further exploration to minimize this trade-off. Future work could explore extending SafePaint's capabilities to address more complex image manipulation scenarios beyond inpainting.	image inpainting, anti-forensics, domain adaptation, attention mechanism, image security
2404.18065 Report	Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model	Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto	In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.	Presents Grounded-Dreamer, a two-stage approach for generating high-fidelity 3D assets from complex, compositional text prompts using a pre-trained multi-view diffusion model.	Addresses the limitations of existing text-to-3D methods that struggle to accurately render compositional prompts and ensure diversity in generated objects.	Employs an attention refocusing mechanism for generating compositionally accurate four-view images and a hybrid optimization strategy combining sparse-view NeRF with text-guided diffusion priors for detailed 3D reconstruction.	Achieves superior text-image alignment compared to state-of-the-art baselines. Generates diverse 3D assets from the same text prompt by varying the input four-view images. Demonstrates high-fidelity 3D generation while preserving accurate compositional relationships.	Reliance on diffusion-based Text-to-Image models can lead to limitations in color accuracy and foreground segmentation. Future work includes exploring seamless 2D-to-3D transitions and enhancing model versatility.	text-to-3d synthesis, multi-view diffusion models, compositional generation, attention refocusing, sparse-view nerf
2404.18020 Report	DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images	Maria Mihaela Trusca, Tinne Tuytelaars, Marie-Francine Moens	Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.	This paper introduces DM-Align, a novel model for text-based semantic image editing that uses word alignments between source and target text instructions to identify and manipulate specific image regions.	Existing text-based image editing methods struggle to maintain background consistency and effectively handle long, complex instructions. DM-Align addresses these limitations by explicitly reasoning about which image parts to alter or preserve.	DM-Align aligns words in the source and target instructions, segments the image based on aligned words, generates a global diffusion mask, refines it using segmented regions, and finally inpaints the masked areas using Stable Diffusion.	DM-Align outperforms baselines in image-based metrics (FID, LPIPS, PWMSE), demonstrating superior editing quality, particularly with longer instructions. It excels at background preservation, as evidenced by significantly lower background FID, LPIPS, and PWMSE scores compared to baselines. Human evaluation confirms DM-Align's effectiveness, achieving higher scores for editing quality, background preservation, and overall image quality.	DM-Align currently focuses on editing objects and their attributes, with future work exploring action editing. The model relies on accurate object detection and segmentation, which might be limited by the capabilities of the employed models (Grounded-SAM).	image editing, semantic editing, text-guided image manipulation, diffusion models, word alignments
2404.17993 Report	MinBackProp -- Backpropagating through Minimal Solvers	Diana Sungatullina, Tomas Pajdla	We present an approach to backpropagating through minimal problem solvers in end-to-end neural network training. Traditional methods relying on manually constructed formulas, finite differences, and autograd are laborious, approximate, and unstable for complex minimal problem solvers. We show that using the Implicit function theorem to calculate derivatives to backpropagate through the solution of a minimal problem solver is simple, fast, and stable. We compare our approach to (i) using the standard autograd on minimal problem solvers and relate it to existing backpropagation formulas through SVD-based and Eig-based solvers and (ii) implementing the backprop with an existing PyTorch Deep Declarative Networks (DDN) framework. We demonstrate our technique on a toy example of training outlier-rejection weights for 3D point registration and on a real application of training an outlier-rejection and RANSAC sampling network in image matching. Our method provides $100\%$ stability and is 10 times faster compared to autograd, which is unstable and slow, and compared to DDN, which is stable but also slow.	The paper proposes MinBackProp, a new approach to backpropagating through minimal problem solvers in end-to-end neural network training using the Implicit Function Theorem (IFT) and Deep Declarative Networks (DDN).	Current methods for backpropagating through minimal problem solvers, such as manual differentiation, finite differences, and autograd, are often laborious, approximate, unstable, or inefficient for complex solvers.	The paper leverages the IFT to directly compute derivatives of the minimal problem solver's output, leading to stable and efficient backpropagation. It also presents an alternative implementation using the DDN framework for simpler implementation and potential use in quick prototyping.	MinBackProp with IFT demonstrates 100% stability in training an outlier rejection network for essential matrix estimation, compared to a 20-30% success rate for autograd-based methods. MinBackProp with IFT achieves a 10 times speedup in backward pass computation compared to autograd and DDN-based approaches. Both IFT and DDN implementations achieve comparable performance to the baseline method (∇-RANSAC) in terms of outlier rejection accuracy.	The current implementation focuses on minimal problems with closed-form solutions, potentially limiting its applicability to a broader range of solvers. The paper explores the use of DDN for minimal problem backpropagation, but further investigation into its limitations and potential advantages is needed.	minimal problem solvers, backpropagation, implicit function theorem, deep declarative networks, end-to-end learning
2404.17876 Report	DF-SLAM: Neural Feature Rendering Based on Dictionary Factors Representation for High-Fidelity Dense Visual SLAM System	Weifeng Wei, Jie Wang	We introduce a high-fidelity neural implicit dense visual Simultaneous Localization and Mapping (SLAM) system, termed DF-SLAM. In our work, we employ dictionary factors for scene representation, encoding the geometry and appearance information of the scene as a combination of basis and coefficient factors. Compared to neural implicit SLAM methods that directly encode scene information as features, our method exhibits superior scene detail reconstruction capabilities and more efficient memory usage, while our model size is insensitive to the size of the scene map, making our method more suitable for large-scale scenes. Additionally, we employ feature integration rendering to accelerate color rendering speed while ensuring color rendering quality, further enhancing the real-time performance of our neural SLAM method. Extensive experiments on synthetic and real-world datasets demonstrate that our method is competitive with existing state-of-the-art neural implicit SLAM methods in terms of real-time performance, localization accuracy, and scene reconstruction quality. Our source code is available at https://github.com/funcdecl/DF-SLAM.	DF-SLAM, a high-fidelity neural implicit dense visual SLAM system, uses dictionary factors for scene representation and feature integration rendering for efficient and high-quality reconstruction.	Existing neural implicit SLAM methods struggle to balance accuracy, memory efficiency, and real-time performance, particularly in large-scale scenes.	The scene is represented using separate basis and coefficient factor grids for geometry and appearance. Feature integration rendering accelerates color rendering by approximating the appearance feature of the entire ray.	DF-SLAM achieves superior scene detail reconstruction compared to baseline methods on Replica and ScanNet datasets. It demonstrates robust tracking performance with reduced drift on Replica, ScanNet, and TUM-RGBD datasets. The method exhibits efficient memory usage, remaining unaffected by map size, unlike memory-intensive alternatives like NICE-SLAM and ESLAM.	Feature integration rendering may lead to artifacts in color rendering with extreme motion blur. Future work will address this by incorporating a deblurring module.	dense visual slam, dictionary factors, feature integration rendering, neural implicit representations, real-time performance
2404.17774 Report	High-quality Surface Reconstruction using Gaussian Surfels	Pinxuan Dai, Jiamin Xu, Wenxiang Xie, Xinguo Liu, Huamin Wang, Weiwei Xu	We propose a novel point-based representation, Gaussian surfels, to combine the advantages of the flexible optimization procedure in 3D Gaussian points and the surface alignment property of surfels. This is achieved by directly setting the z-scale of 3D Gaussian points to 0, effectively flattening the original 3D ellipsoid into a 2D ellipse. Such a design provides clear guidance to the optimizer. By treating the local z-axis as the normal direction, it greatly improves optimization stability and surface alignment. While the derivatives to the local z-axis computed from the covariance matrix are zero in this setting, we design a self-supervised normal-depth consistency loss to remedy this issue. Monocular normal priors and foreground masks are incorporated to enhance the quality of the reconstruction, mitigating issues related to highlights and background. We propose a volumetric cutting method to aggregate the information of Gaussian surfels so as to remove erroneous points in depth maps generated by alpha blending. Finally, we apply screened Poisson reconstruction method to the fused depth maps to extract the surface mesh. Experimental results show that our method demonstrates superior performance in surface reconstruction compared to state-of-the-art neural volume rendering and point-based rendering methods.	This paper proposes Gaussian surfels, a novel point-based representation for high-quality surface reconstruction, combining the advantages of 3D Gaussian points' flexible optimization and surfels' surface alignment.	3D Gaussian Splatting (3DGS), while efficient for 3D scene reconstruction and rendering, struggles to generate high-quality geometry due to limitations like non-zero thickness and normal direction ambiguity. Gaussian surfels address these issues, improving surface alignment and reconstruction quality.	The method flattens 3D Gaussian points into 2D ellipses by setting the z-scale to 0, directly representing surface normals. It introduces a self-supervised normal-depth consistency loss to address gradient vanishing issues and utilizes volumetric cutting to refine depth maps before meshing.	Significantly outperforms 3DGS and SuGaR in surface reconstruction quality on DTU and BlendedMVS datasets. Achieves a good balance between reconstruction quality and speed, comparable to INSR and NeuS2 while reconstructing finer details compared to NeuS. Demonstrates superior generality over 3DGS in sparse view rendering, producing higher quality results.	Struggles with accurate reconstruction in areas with strong specular reflections despite using monocular normal priors. Reconstructed surfaces may exhibit global shifts compared to ground truth for objects with weak textures.	3d surface reconstruction, gaussian surfels, depth-normal consistency, point-based rendering, neural rendering
2404.17762 Report	Large Multi-modality Model Assisted AI-Generated Image Quality Assessment	Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia, Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, Guangtao Zhai	Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. Code is available at https://github.com/wangpuyi/MA-AGIQA.	This paper introduces MA-AGIQA, a novel framework for assessing the quality of AI-generated images by integrating Large Multi-modality Models (LMMs) with traditional deep neural networks (DNNs) to address the limitation of DNNs in capturing semantic content.	Existing DNN-based image quality assessment models, trained primarily on natural scene images, often fail to accurately evaluate the quality of AI-generated images, particularly in terms of semantic coherence and meaningfulness.	MA-AGIQA leverages MANIQA as a quality-aware feature extractor and mPLUG-Owl2, an LMM, as a fine-grained semantic feature extractor guided by meticulously crafted text prompts. An adaptive fusion module, employing a mixture of experts structure, dynamically integrates these features to generate a comprehensive quality score.	MA-AGIQA achieves state-of-the-art performance, surpassing existing methods on two AI-generated image datasets: AIGCQA-20k and AGIQA-3k. The integration of fine-grained semantic features extracted by the LMM significantly improves assessment accuracy, demonstrating a closer alignment with human perception. MA-AGIQA exhibits superior cross-dataset performance, highlighting its robust generalization capabilities.	The current implementation of MA-AGIQA primarily focuses on semantic aspects and may benefit from incorporating additional features for a more holistic assessment. The computational cost associated with employing LMMs, even with fixed parameters, remains a consideration for future optimization.	image quality assessment, ai-generated images, large multi-modality models, semantic content, mixture of experts
2404.17753 Report	Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification	Chao Yi, Lu Ren, De-Chuan Zhan, Han-Jia Ye	CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:https://github.com/YCaigogogo/CVPR24-CODER.	This paper introduces Cross-modal Neighbor Representation (CODER), a novel image representation method for CLIP that leverages cross-modal distances between images and neighboring texts in CLIP's feature space. This approach improves CLIP's performance in single-modality image feature extraction tasks.	Directly using CLIP's image encoder for tasks like few-shot classification can be suboptimal due to a misalignment between its pre-training objectives (cross-modal matching) and feature extraction methods (unimodal). CODER addresses this by aligning feature extraction with CLIP's pre-training, thus improving its effectiveness in downstream tasks.	CODER represents images based on their distances to neighboring texts in CLIP's feature space. To ensure diverse and dense text sampling, the authors introduce Auto Text Generator (ATG) which leverages LLMs like ChatGPT to automatically generate various high-quality, class-specific texts. CODER is applied to zero-shot and few-shot image classification using a two-stage approach for zero-shot and a similarity-based method for few-shot.	CODER consistently improves CLIP's zero-shot image classification accuracy across diverse datasets and model architectures. The two-stage zero-shot classification method further enhances performance by using general CODER for preliminary classification and one-to-one specific CODER for reranking. CODER-Adapter, applying CODER to few-shot classification, outperforms existing training-free CLIP-based methods on most datasets.	Generating texts with ATG using LLMs can be computationally expensive, especially with many classes. CODER's dimensionality, directly proportional to the number of classes, can be problematic for datasets with very few or many classes.	cross-modal learning, clip, image representation, few-shot learning, zero-shot learning
2404.17672 Report	BlenderAlchemy: Editing 3D Graphics with Vision-Language Models	Ian Huang, Guandao Yang, Leonidas Guibas	Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with "imagined" reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.	Presents BlenderAlchemy, a system leveraging Vision-Language Models (VLMs) like GPT-4V to automate 3D graphics editing in Blender based on text and image inputs.	Automating tedious 3D design tasks, like material and lighting design, in software like Blender can boost artist productivity and impact various industries.	Uses a visual program search approach with a vision-aware edit generator and a visual state evaluator to iteratively refine Blender programs based on user intent. Employs 'visual imagination' using image-generation models to enhance VLM understanding when only text input is provided.	Successfully edits procedural materials from text descriptions and reference images, outperforming prior work like BlenderGPT. Demonstrates applicability to lighting design by adjusting lighting configurations based on user intent. Shows the importance of key components: visual state evaluator, visual edit generator, edit reversion mechanism, and visual imagination module.	Currently limited to material and lighting editing, with future work exploring animation, modeling, and other design workflows. Relies on expensive and high-latency VLMs, requiring future optimization or advancements in VLM efficiency.	vision-language models, 3d graphics editing, blender, procedural material editing, lighting design
2404.17571 Report	Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos	Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao	Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.	This paper proposes Tunnel Try-on, the first diffusion-based video virtual try-on model demonstrating state-of-the-art performance in complex, real-world scenarios.	Video virtual try-on offers a more comprehensive and realistic clothing try-on experience than image-based try-on but faces challenges in preserving clothing details and generating coherent motions. Existing methods struggle to handle complex scenarios with diverse clothing, backgrounds, and movements.	Tunnel Try-on introduces a "focus tunnel" to zoom in on the clothing region, enhancing detail preservation. It leverages a Kalman filter to smooth tunnel movements, injects tunnel embeddings into attention layers for motion consistency, and employs an environment encoder for capturing background context.	Tunnel Try-on significantly outperforms existing video try-on methods on standard benchmarks and a newly collected dataset. It effectively handles various camera movements, human motions, and clothing types, generating high-fidelity try-on results. Ablation studies demonstrate the contribution of each proposed component to the model's performance.	The current implementation relies on a pre-trained pose estimator, which might limit its generalization ability to unseen poses. Further research can explore incorporating user preferences and interactive controls to enhance the personalization and controllability of virtual try-on.	virtual try-on, video generation, diffusion models, computer vision, fashion technology
2404.17528 Report	Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields	Tianqi Liu, Xinyi Ye, Min Shi, Zihao Huang, Zhiyu Pan, Zhan Peng, Zhiguo Cao	Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However, existing methods show limited generalization ability in challenging conditions due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. We address these issues point by point. First, we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors, our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors, we observe the two existing decoding strategies excel in different areas, which are complementary. A Consistency-Aware Fusion (CAF) strategy is proposed to leverage the advantages of both. We incorporate the above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. Code is available at https://github.com/TQTQliu/GeFu .	This paper introduces GeFu, a novel generalizable NeRF framework for novel view synthesis in unseen scenes, improving geometry reconstruction, descriptor encoding, and rendering strategies.	Existing generalizable NeRF methods struggle to achieve satisfactory results in challenging conditions, particularly within occluded areas, due to limitations in geometry accuracy, descriptor quality, and decoding strategies.	GeFu incorporates Adaptive Cost Aggregation (ACA) for robust geometry estimation, Spatial-View Aggregator (SVA) for 3D context-aware descriptors, and Consistency-Aware Fusion (CAF) for integrating different rendering strategies.	GeFu achieves state-of-the-art performance on DTU, Real Forward-facing, and NeRF Synthetic datasets without per-scene fine-tuning. After fine-tuning, GeFu surpasses previous generalizable NeRFs and achieves comparable or superior results to NeRF. GeFu exhibits high accuracy in depth map generation, outperforming other generalizable NeRF methods and even surpassing some MVS methods.	GeFu is designed for static scenes and may not be directly applicable to dynamic scenes. The fine-tuning and rendering processes remain computationally demanding for NeRF-based methods, including GeFu.	novel view synthesis, generalizable nerf, neural radiance fields, multi-view stereo, 3d reconstruction
2404.17486 Report	TextGaze: Gaze-Controllable Face Generation with Natural Language	Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang	Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.	Introduces TextGaze, a novel gaze-controllable face generation method that uses textual descriptions of human gaze and head behavior instead of numerical gaze values.	Existing gaze-controllable face generation methods rely on numerical gaze values, which is unnatural and requires annotated datasets, limiting their application. Text descriptions are more intuitive and user-friendly.	A two-stage method: 1) Text-to-Gaze Generation: Extracts gaze and head pose from text descriptions using CLIP embeddings and a text attention module, then generates a face sketch using a 3D face model. 2) Gaze-Controllable Face Generation: Generates face images from the face sketches using a conditional diffusion model.	Introduces ToG, the first text-to-gaze dataset with over 90k descriptions, leveraging LLMs for accurate and diverse annotations. Generates more accurate gaze-controllable face images than baseline methods based on user study. Achieves comparable or better image quality (IS, FID, KID) compared to baseline text-to-image generation methods.	Limited variability in low-precision descriptions within the ToG dataset. Reliance on pre-trained pose estimators for evaluation and comparison with baseline models.	text-to-image generation, diffusion model, gaze-controllable, face generation, large language models
2404.17419 Report	Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation	Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang	Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.	This paper introduces \methodName, a novel approach for 3D object generation that leverages multiple image prompts to enhance the quality and consistency of generated 3D models.	Existing image-to-3D generation methods, while promising, often struggle to maintain consistency in detail, texture, and lighting across different viewpoints. Using multiple image prompts can address these limitations by providing richer guidance during the generation process.	The authors extend ImageDream, a state-of-the-art image-to-3D method, to support multiple image inputs. They achieve this by modifying the local and pixel controllers of ImageDream to handle multiple images, enabling the model to incorporate information from various viewpoints without requiring fine-tuning.	Quantitative evaluation metrics, including IS and CLIP scores, demonstrate that \methodName outperforms the baseline ImageDream in multi-view image generation. \methodName also exhibits competitive performance in 3D generation, though the improvements are less pronounced. Qualitative assessments reveal that using multiple image prompts effectively reduces artifacts like excessive whitening and lack of detail in the generated 3D models, particularly at viewpoints not covered by the primary image prompt.	The quantitative evaluation is limited by the small number (39) of prompts used, potentially impacting the generalizability of the findings. Future work could explore fine-tuning the model specifically for multi-image prompts and investigate methods to explicitly leverage the cross-view relationships between the input images.	3d generation, image-to-3d, multi-view diffusion, imagedream, multi-image prompts
2404.17364 Report	MV-VTON: Multi-View Virtual Try-On with Diffusion Models	Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo	The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results of a person from multiple views using the given clothes. On the one hand, given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. On the other hand, the diffusion models that have demonstrated superior abilities are adopted to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest a joint attention block to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset, i.e., Multi-View Garment (MVG), in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets will be publicly released at https://github.com/hywang2002/MV-VTON .	Introduces MV-VTON, a novel task aiming to generate realistic multi-view dressed person images using frontal and back clothing views, and proposes a diffusion-based method with a view-adaptive selection mechanism and joint attention block to address it.	Addresses limitations of existing virtual try-on methods that primarily focus on frontal views and struggle with inconsistent clothing-person poses, particularly in multi-view scenarios.	Utilizes a diffusion model with a view-adaptive selection mechanism (hard-selection for global features and soft-selection for local features) based on person-clothing pose similarity, and a joint attention block to align and fuse global and local clothing features with person features for detail preservation.	Achieves state-of-the-art performance on both multi-view (MVG dataset) and frontal-view (VITON-HD and DressCode datasets) virtual try-on tasks, quantitatively and qualitatively. Effectively handles inconsistencies between clothing and person poses in multi-view scenarios, resulting in more natural and realistic try-on results. Exhibits superior performance in preserving high-frequency clothing details, such as texts, patterns, and shapes, compared to existing methods.	Struggles to fully preserve smaller or more complex clothing details, potentially due to information loss during inpainting in latent space. Limited to using two views (frontal and back) of clothing, which may not fully capture the complexity of certain garments.	virtual try-on, multi-view, diffusion models, view-adaptive selection, joint attention
2404.17255 Report	SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes	Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos	AI systems rely on extensive training on large datasets to address various tasks. However, image-based systems, particularly those used for demographic attribute prediction, face significant challenges. Many current face image datasets primarily focus on demographic factors such as age, gender, and skin tone, overlooking other crucial facial attributes like hairstyle and accessories. This narrow focus limits the diversity of the data and consequently the robustness of AI systems trained on them. This work aims to address this limitation by proposing a methodology for generating synthetic face image datasets that capture a broader spectrum of facial diversity. Specifically, our approach integrates a systematic prompt formulation strategy, encompassing not only demographics and biometrics but also non-permanent traits like make-up, hairstyle, and accessories. These prompts guide a state-of-the-art text-to-image model in generating a comprehensive dataset of high-quality realistic images and can be used as an evaluation set in face analysis systems. Compared to existing datasets, our proposed dataset proves equally or more challenging in image classification tasks while being much smaller in size.	This paper proposes a methodology for generating synthetic face image datasets that are more diverse and inclusive than existing ones, capturing a wider range of facial attributes.	Existing face image datasets often lack diversity in facial attributes beyond basic demographics, limiting the robustness and fairness of AI systems trained on them.	The methodology employs a systematic prompt formulation strategy for text-to-image generation, incorporating attributes like hairstyle, accessories, and facial expressions, and utilizes a denoising diffusion probabilistic model (Stable Diffusion 2.1) for generating high-quality realistic images.	The generated dataset (SDFD) captures a wide variety of facial attributes despite its small size (1000 images). SDFD proves equally or more challenging in image classification tasks compared to larger datasets like FairFace and LFW. Visualization of the datasets reveals that SDFD exhibits good spatial dispersion, suggesting a higher degree of facial attribute variety.	Certain attributes and their combinations were challenging to apply effectively during the image generation process, highlighting limitations in the training data of the generative model. Stereotypical representations emerged in some generated images, indicating the need for further investigation and mitigation of biases.	synthetic data generation, face image datasets, diversity and inclusion, text-to-image synthesis, diffusion models
2404.17254 Report	Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection	Jiawei Song, Dengpan Ye, Yunming Zhang	Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6\% improvement in transferability in the diffusion datasets.	This paper proposes Trinity Detector, a novel method for detecting images generated by diffusion models by leveraging multi-spectral channel attention and integrating text-based and image-based features.	The rise of diffusion models in AI-generated content (AIGC) necessitates new forgery detection methods specifically designed for this technology due to its unique characteristics compared to traditional generation techniques.	The Trinity Detector uses a Multi-spectral Channel Attention Fusion Unit (MCAF) to analyze spectral inconsistencies in the frequency domain and combines it with text features extracted using a CLIP encoder, providing a comprehensive multimodal detection approach.	Trinity Detector outperforms state-of-the-art detectors, especially on diffusion-generated images. The method shows strong generalization ability, effectively detecting forgeries from untrained diffusion models. Trinity Detector exhibits robust performance even with image perturbations like Gaussian blur and JPEG compression.	The paper acknowledges the need for evaluating the method on a wider range of diffusion models beyond Stable Diffusion and GLIDE. Future work could explore alternative frequency domain analysis techniques or incorporate additional modalities for further performance improvement.	diffusion models, forgery detection, deepfakes, multimodal learning, frequency domain analysis
2404.17230 Report	ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion	Ziyue Zhang, Mingbao Lin, Rongrong Ji	We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas	ObjectAdd, a training-free method to add user-specified objects into pre-existing images generated by diffusion models while preserving the rest of the image content.	Addresses limitations of text-to-image models that struggle to convey spatial relationships and necessitate tedious multi-step modifications to achieve desired results.	Combines embedding-level concatenation for accurate text prompts, object-driven layout control with latent and attention injection for precise object placement, and prompted image inpainting with attention refocusing and object expansion for seamless integration and background consistency.	Successfully adds objects into user-defined areas while maintaining image consistency. Outperforms existing methods like DALL-E 3, P2P, and SD-v1-4 qualitatively and quantitatively. Demonstrates versatility by accurately adding diverse objects and handling complex object-background interactions.	Usability limitations for non-experts due to the method's complexity and hyperparameter tuning. Performance dependency on the pre-trained SD-v1-4 model, potentially limiting effectiveness in complex scenarios.	diffusion model, training-free, text to image, image editing, object insertion
2404.16994 Report	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng	Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at https://pllava.github.io/	This paper proposes Pooling LLaVA (PLLaVA), a simple yet effective method for adapting pre-trained image-language models to video understanding by introducing a pooling strategy to smooth feature distribution along the temporal dimension and reduce the impact of extreme features.	Adapting existing image-language models for video understanding is crucial for efficient and resource-light model development, but directly fine-tuning these models on videos can lead to performance saturation and vulnerability to prompt changes.	PLLaVA encodes video frames using an image encoder, applies an average pooling operation to the features along the spatial dimension, and feeds the pooled features to a pre-trained LLM with LoRA for adaptation. Post-training optimization is also used to merge the weights of the original image LLM and the video-trained LLM.	PLLaVA achieves state-of-the-art performance on various video understanding benchmarks, including VideoQA and MVBench. PLLaVA exhibits strong performance in generating detailed video captions, outperforming previous methods in aspects like correctness of information, detail orientation, and context understanding. The paper provides analysis on the impact of pooling strategies and the influence of LoRA weight fusion, offering insights into adapting image-language models for video tasks.	PLLaVA's performance on tasks requiring strong reasoning ability and imagination, such as counterfactual inference, can be further improved. Exploring the use of specialized video encoders and more advanced temporal modeling techniques could further enhance PLLaVA’s capabilities.	video understanding, multimodal learning, large language models, pooling methods, vision-language models
2404.16829 Report	Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials	Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, Dahua Lin	Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original albedo map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.	Make-it-Real is a novel framework that leverages Multimodal Large Language Models (MLLMs), specifically GPT-4V, to automatically assign and generate physically realistic materials for 3D objects with only albedo maps.	Many existing 3D assets and generative models lack realistic material properties. Manual material assignment is tedious and time-consuming. Make-it-Real automates this process, enhancing realism and streamlining 3D content creation.	The pipeline involves rendering and segmenting the 3D mesh, retrieving matching materials from a meticulously annotated material library using GPT-4V and hierarchical text prompts, and generating spatially varying BRDF maps (including roughness, metallic, specular, normal, displacement, height) by referencing the original albedo map.	Make-it-Real enhances the realism of 3D assets, generating high-fidelity, photorealistic textures with diverse reflective effects under different lighting conditions. The framework ensures part-specific material matching, accurately identifying and applying different materials to various components of a 3D object. It generates comprehensive material maps compatible with downstream rendering engines, streamlining the integration of refined assets into existing workflows.	The method currently lacks support for reverse transformation from shaded texture maps to albedo maps. The quality of the base 3D object significantly impacts the accuracy of material assignment, particularly when ground truth text descriptions are unavailable.	3d material generation, multimodal large language models, gpt-4v, texture synthesis, physically based rendering
2404.16821 Report	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang	In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.	InternVL 1.5, an open-source multimodal large language model (MLLM), is introduced to bridge the capability gap between open-source and proprietary models in multimodal understanding.	There is a noticeable divide between the capabilities of open-source and proprietary commercial MLLMs, particularly in parameter scale, image resolution handling, and multilingual capabilities.	The paper introduces three primary improvements: (1) Continuous learning for a large-scale vision foundation model (InternViT-6B) to boost visual understanding. (2) Dynamic high-resolution strategy using image tiling (up to 4K) for detailed scene and document understanding. (3) Creation of a high-quality bilingual dataset covering diverse scenes, documents, and conversations in English and Chinese.	InternVL 1.5 achieves state-of-the-art results in 8 out of 18 multimodal benchmarks, surpassing some leading proprietary models. The model shows competitive performance in OCR-related tasks, exceeding proprietary models on benchmarks like ChartQA and OCRBench. InternVL 1.5 demonstrates strong bilingual proficiency, particularly excelling in Chinese-related tasks compared to other open-source and proprietary models.	Despite improvements, InternVL 1.5 still lags behind top proprietary models in multi-turn conversations, suggesting a direction for future research. The model's performance on certain tasks slightly declined compared to its predecessor due to the smaller language model used.	multimodal large language model, vision-language understanding, open-source, dynamic high-resolution, bilingual
2404.16771 Report	ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving	Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang	Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.	This paper introduces ConsistentID, a novel method for generating high-fidelity, diverse, and identity-preserving portraits using a single reference image and multimodal fine-grained prompts.	Existing methods for personalized portrait generation struggle to maintain accurate identity consistency and high-fidelity details, particularly in fine-grained facial features.	ConsistentID leverages a multimodal facial prompt generator to combine facial features, descriptions, and overall context. It also utilizes an ID-preservation network with facial attention localization to ensure consistent identity across facial regions. Additionally, a new fine-grained dataset (FGID) is introduced for training and evaluation.	ConsistentID outperforms state-of-the-art methods in identity consistency, diversity, and fidelity, as demonstrated by both quantitative metrics and qualitative comparisons. The proposed facial attention localization strategy effectively prevents the blending of identities between facial regions, leading to improved ID preservation in generated images. The introduction of the FGID dataset and a new fine-grained identity consistency metric provide a valuable resource for advancing research in facial generation.	The use of MLLM in ConsistentID may introduce limitations in handling pose and expression variations. Further research is needed to address potential ethical concerns related to privacy and misinformation.	portrait generation, identity preservation, multimodal learning, fine-grained control, diffusion models
2404.16752 Report	TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation	Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, Michael J. Black	We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.	Introduces TokenHMR, a novel 3D human pose and shape regression method that leverages a token-based pose representation and a new loss function, TALS, to improve 3D accuracy.	Addresses the trade-off between 2D and 3D accuracy in current HPS regression methods caused by approximate camera models, leading to biased pose estimations.	Combines a tokenized pose representation using VQ-VAE to learn a prior over valid poses and introduces TALS, a loss function that reduces reliance on noisy 2D and pseudo-ground truth data.	Achieves state-of-the-art 3D accuracy on EMDB and 3DPW datasets. Demonstrates robustness to image truncation and ambiguous poses. Shows that tokenization leads to more accurate and robust pose estimations compared to continuous regression.	2D alignment can be inaccurate under severe perspective distortion due to the use of a weak-perspective camera model. Global orientation estimation can be ambiguous in cases where body cues are limited.	human pose and shape estimation, 3d human pose, tokenization, vq-vae, camera bias
2404.16748 Report	TELA: Text to Layer-wise 3D Clothed Human Generation	Junting Dong, Qi Fang, Zehuan Huang, Xudong Xu, Jingbo Wang, Sida Peng, Bo Dai	This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: http://jtdong.com/tela_layer/	This paper proposes a novel layer-wise representation and a progressive optimization strategy for generating 3D clothed humans from textual descriptions.	Previous methods struggle with clothing editing and lack fine-grained control due to their holistic approach to modeling the body and clothing.	The proposed method progressively generates a minimally-clothed body followed by layer-wise clothes using a stratified compositional rendering method for fusion and a new loss function to decouple clothing from the body.	Achieves state-of-the-art 3D clothed human generation from text. Provides high-quality disentanglement between clothing and body. Enables cloth editing applications such as virtual try-on.	Specific details about the novel loss function and its effectiveness are absent. Quantitative evaluation of the disentanglement quality compared to previous works is missing.	text-to-3d generation, clothed human generation, layer-wise representation, progressive optimization, virtual try-on
2404.16687 Report	NTIRE 2024 Quality Assessment of AI-Generated Content Challenge	Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng, Jianquan Yang, Weigang Wang, Xi Fang, Xiaoxin Lv, Jun Yan, Tianwu Zhi, Yabin Zhang, Yaohui Li, Yang Li, Jingwen Xu, Jianzhao Liu, Yiting Liao, Junlin Li, Zihao Yu, Yiting Lu, Xin Li, Hossein Motamednia, S. Farhad Hosseini-Benvidi, Fengbin Guan, Ahmad Mahmoudi-Aznaveh, Azadeh Mansouri, Ganzorig Gankhuyag, Kihwan Yoon, Yifang Xu, Haotian Fan, Fangyuan Kong, Shiling Zhao, Weifeng Dong, Haibing Yin, Li Zhu, Zhiling Wang, Bingchen Huang, Avinab Saha, Sandeep Mishra, Shashank Gupta, Rajesh Sureddi, Oindrila Saha, Luigi Celona, Simone Bianco, Paolo Napoletano, Raimondo Schettini, Junfeng Yang, Jing Fu, Wei Zhang, Wenzhi Cao, Limei Liu, Han Peng, Weijun Yuan, Zhan Li, Yihang Cheng, Yifan Deng, Haohui Li, Bowen Qu, Yao Li, Shuqing Luo, Shunzhou Wang, Wei Gao, Zihao Lu, Marcos V. Conde, Xinrui Wang, Zhibo Chen, Ruling Liao, Yan Ye, Qiulin Wang, Bing Li, Zhaokun Zhou, Miao Geng, Rui Chen, Xin Tao, Xiaoyu Liang, Shangkun Sun, Xingyuan Ma, Jiaze Li, Mengduo Yang, Haoran Xu, Jie Zhou, Shiding Zhu, Bohan Yu, Pengfei Chen, Xinrui Xu, Jiabin Shen, Zhichao Duan, Erfan Asadi, Jiahe Liu, Qi Yan, Youran Qu, Xiaohui Zeng, Lele Wang, Renjie Liao	This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.	This paper summarizes the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, focusing on developing objective IQA and VQA methods for AI-generated images and videos.	The challenge addresses the critical need for accurate quality assessment of AI-generated content, a growing presence in daily life, to guide the improvement of generative models and enhance user experience.	The challenge, divided into image and video tracks, utilized AIGIQA-20K and T2VQA-DB datasets, respectively, with participants tasked to predict perceptual quality scores of AI-generated images/videos based on training data and corresponding MOSs. The evaluation used SRCC and PLCC to measure prediction accuracy.	The challenge attracted 514 participants and yielded 28 valid submissions. Most submitted methods outperformed baseline I/VQA models, indicating progress in aligning objective metrics with human perception. Top-performing methods demonstrated strong correlation between predicted scores and MOSs, emphasizing their potential in guiding the generation of higher-quality content.	Limited number of AIGV datasets compared to AIGI datasets. Future research could explore more sophisticated methods for multi-dimensional quality assessment of AIGC.	ai-generated content, image quality assessment, video quality assessment, generative models, perceptual quality
2404.16612 Report	MuseumMaker: Continual Style Customization without Catastrophic Forgetting	Chenxi Liu, Gan Sun, Wenqi Liang, Jiahua Dong, Can Qin, Yang Cong	Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to extract and learn the styles of the training data for new image generation. It can minimize the learning biases caused by content of new training images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, to further preserve historical knowledge from past styles and address the limited representability of LoRA, we consider a task-wise token learning module where a unique token embedding is learned to denote a new style. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.	This paper presents MuseumMaker, a novel approach for continual style customization in text-to-image diffusion models that addresses catastrophic forgetting and overfitting.	Enabling diffusion models to continuously learn new styles from user-provided images without forgetting previously learned ones is crucial for personalized and evolving image generation.	MuseumMaker employs a style distillation loss to extract pure style representations, a dual regularization for shared-LoRA to preserve style knowledge during optimization, and task-wise token learning to capture distinct features of each style.	MuseumMaker demonstrates superior performance compared to existing methods, showing significant improvements in style loss, FID, and CLIP score. The ablation studies highlight the effectiveness of each proposed module in mitigating catastrophic forgetting and overfitting. MuseumMaker proves to be efficient with minimal training parameters and competitive training time while achieving near-upper-bound performance.	The current implementation focuses on a limited number of styles. Exploring more sophisticated techniques for knowledge distillation and preservation in continual learning settings could further enhance the model's capabilities.	text-to-image generation, style customization, continual learning, diffusion models, catastrophic forgetting
2404.16510 Report	Interactive3D: Create What You Want by Interactive 3D Generation	Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu	3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{https://interactive-3d.github.io/}.	Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities.	Current 3D object generation methods lack precise user control, relying on text prompts or 2D images that limit controllability and quality.	A two-stage process leveraging Gaussian Splatting for direct user interaction (adding/removing parts, dragging, transformations, semantic editing) followed by conversion to InstantNGP for refinement with an Interactive Hash Refinement module.	Achieves high-quality and controllable 3D generation with demonstrated examples like modifying a human pose and creating a dragon. Outperforms state-of-the-art methods in CLIP R-Precision, indicating improved controllability. Enables efficient 3D generation due to fast Gaussian Splatting initialization and integration of interactions within the optimization process.	Susceptible to failure under excessive or unreasonable user manipulation. Inherits common challenges of current generative techniques, including color saturation issues.	3d generation, interactive design, gaussian splatting, instantngp, user controllability
2404.16375 Report	List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang	Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{https://github.com/zzxslp/SoM-LLaVA}.	This paper introduces "list items one by one," a novel learning paradigm and dataset for enhancing multimodal large language models (MLLMs) with Set-of-Mark (SoM) prompting capabilities.	SoM prompting enables MLLMs to ground visual objects to tags on images, facilitating tasks like GUI navigation and robot interaction. However, this ability is predominantly observed in GPT-4V, limiting its wider adoption.	The authors curate a dataset using Semantic-SAM to tag objects in MS-COCO images. GPT-4V then generates descriptions for these tags, training MLLMs to enumerate tagged items in alphanumeric order.	With limited data (10k samples), MLLMs significantly improve in tag listing accuracy, even surpassing zero-shot GPT-4V. Fine-tuned SoM MLLMs demonstrate enhanced performance on five MLLM benchmarks (POPE, MME, SEED-Bench, LLaVA-Bench, MM-Vet), indicating improved visual reasoning. Surprisingly, models trained with SoM data exhibit superior performance even without tags during inference, highlighting the paradigm's potential for general MLLM training.	The study primarily focuses on MS-COCO images, potentially limiting generalization to other datasets. Future work can explore alternative tagging methods and data sources to further enhance SoM prompting.	multimodal large language models, set-of-mark prompting, visual grounding, visual reasoning, instruction tuning
2404.16323 Report	DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction	Jiamin Wu, Kenkun Liu, Han Gao, Xiaoke Jiang, Lei Zhang	In this paper, we study the problem of 3D reconstruction from a single-view RGB image and propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis. Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder. In particular, we introduce the use of deformable transformer, allowing efficient and effective decoding through 3D reference point and multi-layer refinement adaptations. By harnessing the benefits of 3D Gaussians, our approach offers an efficient and accurate solution for 3D reconstruction from single-view images. We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively. The result outperforming the recent method by around 2.25%, demonstrating the effectiveness of our method in achieving superior results.	DIG3D, a novel encoder-decoder framework leveraging deformable transformers and 3D Gaussian splatting for efficient single-view 3D object reconstruction and novel view synthesis.	Addresses limitations of previous methods, such as incorrect geometry and reliance on shortcuts, while maintaining fast rendering speed.	Combines pixel-aligned features from a UNet and depth-aware features from a pretrained DINOv2 model. Uses a deformable transformer decoder with 3D reference points and multi-layer refinement to predict 3D Gaussian parameters.	Outperforms Splatter Image on the ShapeNet SRN dataset, particularly for views far from the input view. Achieves high rendering quality, accurately capturing occlusions and producing realistic renderings. Reconstructs meaningful 3D structures, evidenced by the visualization of Gaussian centers as point clouds.	Training time is longer compared to Splatter Image. Further improvements in geometry reconstruction are possible.	3d reconstruction, novel view synthesis, single-view reconstruction, deformable transformer, 3d gaussian splatting
2404.16306 Report	TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models	Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks	Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.	This paper introduces TI2V-Zero, a zero-shot, tuning-free method that enables pretrained text-to-video (T2V) diffusion models to be conditioned on a provided image, facilitating text-conditioned image-to-video (TI2V) generation without any optimization or fine-tuning.	Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning, limiting their flexibility and generalizability. TI2V-Zero overcomes these limitations by leveraging pretrained T2V models, making it efficient and widely applicable.	The approach leverages a pretrained T2V diffusion model and introduces a "repeat-and-slide" strategy to guide video generation. It modulates the reverse denoising process, allowing the model to synthesize video frame-by-frame from the provided image. A DDPM inversion strategy initializes Gaussian noise for temporal consistency, and a resampling technique preserves visual details.	TI2V-Zero consistently outperforms a recent open-domain TI2V model in experiments on domain-specific (MUG, UCF-101) and open-domain datasets. The method effectively preserves identity and background details, resulting in more visually pleasing and temporally coherent videos. TI2V-Zero demonstrates its versatility by extending to other video-related tasks, including video infilling, prediction, and long video generation.	The generation quality is limited by the capabilities of the pretrained T2V model. The generation speed is slow due to the need to run the entire diffusion process for each frame, and the generated video might have blurriness or flickering artifacts.	text-to-video generation, image-to-video generation, diffusion models, zero-shot learning, video generation
2404.16221 Report	NeRF-XL: Scaling NeRFs with Multiple GPUs	Ruilong Li, Sanja Fidler, Angjoo Kanazawa, Francis Williams	We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPUs, thus enabling the training and rendering of NeRFs with an arbitrarily large capacity. We begin by revisiting existing multi-GPU approaches, which decompose large scenes into multiple independently trained NeRFs, and identify several fundamental issues with these methods that hinder improvements in reconstruction quality as additional computational resources (GPUs) are used in training. NeRF-XL remedies these issues and enables the training and rendering of NeRFs with an arbitrary number of parameters by simply using more hardware. At the core of our method lies a novel distributed training and rendering formulation, which is mathematically equivalent to the classic single-GPU case and minimizes communication between GPUs. By unlocking NeRFs with arbitrarily large parameter counts, our approach is the first to reveal multi-GPU scaling laws for NeRFs, showing improvements in reconstruction quality with larger parameter counts and speed improvements with more GPUs. We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km^2 city area.	Presents NerfXL, a method for distributing Neural Radiance Fields (NeRFs) across multiple GPUs to enable training and rendering of NeRFs with arbitrarily large capacity.	Existing multi-GPU approaches for NeRFs suffer from redundancy and reduced visual quality as the number of GPUs increases, limiting their ability to handle large-scale, high-detail scenes.	NerfXL partitions 3D space into non-overlapping tiles, assigns a NeRF to each tile, and jointly trains them across GPUs, minimizing communication overhead through a novel distributed training and rendering formulation.	NerfXL achieves significant improvements in visual quality (PSNR) and rendering speed with more GPUs compared to existing independent training approaches. The method effectively handles large-scale captures (up to 25km²), demonstrating robust scalability. NerfXL enables exploring larger model capacities for NeRFs, which proves more beneficial than simply increasing the training batch size (as in PyTorch DDP).	Multi-GPU synchronization, while minimized, remains a bottleneck for training and rendering speed. While theoretically agnostic to NeRF representation, the method has only been tested with Instant-NGP and could be explored with other representations.	neural radiance fields, nerf, multi-gpu, distributed training, novel view synthesis
2404.16030 Report	MoDE: CLIP Data Experts via Clustering	Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu	The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.	Introduces Mixture of Data Experts (MoDE), a system of CLIP data experts learned via clustering to improve contrastive language-image pretraining by mitigating noise in web-crawled image-caption pairs.	Noise in web-crawled data negatively impacts CLIP's performance, and scaling CLIP on large datasets presents training efficiency and computational challenges.	The method clusters data to train specialized data experts, each focusing on a subset of data with coherent semantics. Inference involves ensembling expert outputs based on task metadata and cluster relevance.	Significantly outperforms OpenCLIP and OpenAI CLIP on benchmarks. Reduces training cost to less than 35% compared to baselines. Enables asynchronous training of data experts and flexible inclusion of new experts.	The number of clusters must balance semantic coherence with computational feasibility. Future work includes adapting MoDE for generative models.	contrastive learning, image-language pretraining, data clustering, noise reduction, ensemble learning
2404.16029 Report	Editable Image Elements for Controllable Synthesis	Jiteng Mu, Michaël Gharbi, Richard Zhang, Eli Shechtman, Nuno Vasconcelos, Xiaolong Wang, Taesung Park	Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into "image elements" that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition. Project page: https://jitengmu.github.io/Editable_Image_Elements/	This paper proposes "editable image elements," a novel image representation for controllable synthesis with diffusion models, allowing intuitive spatial editing of user-provided images.	Existing diffusion models struggle with image editing as their noise-based input space isn't designed for spatial manipulations. This work addresses this limitation by providing an intuitive and effective way to edit images within the diffusion framework.	The method encodes an input image into semantically meaningful "image elements" (superpixels) with learnable embeddings and editable spatial properties (position, size). A diffusion model, trained with element dropout for robustness, decodes edited elements into realistic images.	The approach enables a range of edits: object resizing, rearrangement, dragging, de-occlusion, removal, variation, and composition. The method outperforms baselines like Self-Guidance, Paint-by-Example, and InstructPix2Pix in user studies, demonstrating superior quality and edit fidelity. Ablation studies confirm the importance of staged training, content encoder freezing, and random partition dropout during training for optimal performance.	Editing high-resolution images remains challenging due to reconstruction quality limitations. Exploring methods to edit the appearance of image elements beyond spatial manipulations is left for future work.	image editing, disentangled representation, diffusion models, controllable synthesis, image elements
2404.16022 Report	PuLID: Pure and Lightning ID Customization via Contrastive Alignment	Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Qian He	We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible. Codes and models will be available at https://github.com/ToTheBeginning/PuLID	Proposes PuLID, a tuning-free identity (ID) customization method for text-to-image generation that maintains high ID fidelity while minimizing interference with the original model's behavior.	Existing tuning-free ID customization methods struggle to achieve high ID fidelity without disrupting the original model's ability to follow prompts and maintain stylistic consistency.	Introduces a Lightning T2I branch alongside the standard diffusion training branch. Employs contrastive alignment loss between images generated with and without ID insertion to minimize disruption. Leverages fast sampling to generate high-quality images for accurate ID loss calculation.	Achieves superior ID fidelity compared to state-of-the-art methods like IPAdapter and InstantID. Demonstrates better preservation of original image elements (background, lighting, composition, style) compared to existing methods. Maintains respectable prompt editing capabilities for modifying ID attributes, orientations, and accessories.	The prompt list used for contrastive alignment, while effective, could be further optimized. Exploring more advanced alignment techniques beyond semantic and layout alignment could lead to further improvements.	text-to-image generation, identity customization, diffusion models, contrastive learning, fast sampling
2404.15956 Report	A Survey on Visual Mamba	Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Zi Ye	State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.	This paper presents the first comprehensive survey of Mamba models in computer vision, examining their foundational concepts, architectural enhancements, and diverse applications across various vision tasks.	Mamba models offer a promising alternative to Transformers in computer vision due to their ability to capture long-range dependencies with linear complexity, leading to improved efficiency and performance, especially for high-resolution image processing.	The paper reviews different Mamba block designs like ViM and VSS and analyzes their integration with techniques like convolution, recurrence, and attention. It categorizes existing works based on their applications in general vision (high/mid and low-level), medical imaging, and remote sensing.	Mamba-based models demonstrate competitive performance compared to Transformers in various vision tasks such as image classification, object detection, segmentation, restoration, and generation. Different scanning mechanisms are crucial for extending Mamba to multi-dimensional visual data, and the choice of mechanism depends on the specific task and data characteristics. Combining Mamba with other architectures like convolution, recurrence, and attention further enhances its capabilities and performance in specific applications like medical image segmentation and video understanding.	Most Mamba models are still in their early stages, lacking extensive pre-training on large-scale datasets like ImageNet, which limits their generalization ability. Future work should focus on exploring more efficient scanning mechanisms, developing pre-trained Mamba models, and enhancing their interpretability and robustness for real-world deployment.	mamba, computer vision, state space model, visual mamba, deep learning
2404.15955 Report	Beyond Deepfake Images: Detecting AI-Generated Videos	Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, Matthew C. Stamm	Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.	This paper investigates the effectiveness of synthetic image detectors in detecting synthetic videos, revealing that image detectors perform poorly on videos due to the distinct traces left by video generators.	The emergence of realistic synthetic videos generated by AI poses a significant threat of misinformation and disinformation.	The authors evaluate various synthetic image detectors on a dataset of real and synthetic videos. They analyze the low-level forensic traces left by both image and video generators and investigate the impact of H.264 compression and robust training.	Synthetic image detectors fail to reliably detect AI-generated videos, even with robust training against H.264 compression. Synthetic video generators leave unique traces that differ significantly from those found in synthetic images. Training detectors specifically on synthetic video traces enables reliable detection and source attribution, even after H.264 re-compression.	The study primarily focuses on a limited set of publicly available video generators. Future work should explore the generalization of detectors to entirely new and unseen generation techniques.	synthetic video detection, generative ai, misinformation detection, forensic traces, few-shot learning
2404.15909 Report	Learning Long-form Video Prior via Generative Pre-Training	Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou	Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with consistent IDs, bounding boxes, and whole body keypoints. In this way, long-form videos can be represented by a set of tokens and be learned via generative pre-training. Experimental results validate that our approach has great potential for learning long-form video prior. Code and data will be released at \url{https://github.com/showlab/Long-form-Video-Prior}.	This paper proposes learning the long-form video prior via generative pre-training by representing videos as sequences of tokens from bounding boxes, keypoints, and textual descriptions.	Current video generation methods struggle with long-form videos due to their complexity and long-range dependencies. Learning the implicit prior of long-form videos can improve video generation in this domain.	The authors create a new dataset, Storyboard20K, consisting of movie storyboards with annotations of character bounding boxes, keypoints, film set bounding boxes, and textual descriptions. They represent each storyboard as a sequence of tokens and train a GPT-2 model to predict the next token in the sequence.	The proposed method outperforms GPT-3.5 in generating coherent and contextually relevant movie storyboards based on textual metrics. The method also achieves superior performance in visual evaluation using FID compared to GPT-3.5, demonstrating its ability to model and generate visually plausible storyboards. The model exhibits a high decoding success rate (92.5%) for converting generated token sequences back into movie storyboard format.	The current work focuses on learning the prior of movie storyboards instead of pixel-level videos. The work is limited by the computational resources, restricting the maximum number of tokens representing a storyboard.	generative pre-training, long-form video prior, storyboard generation, video understanding, movie datasets
2404.15891 Report	OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation	Lizhi Wang, Feng Zhou, Jianqin Yin	Recent advancements in 3D reconstruction technologies have paved the way for high-quality and real-time rendering of complex 3D scenes. Despite these achievements, a notable challenge persists: it is difficult to precisely reconstruct specific objects from large scenes. Current scene reconstruction techniques frequently result in the loss of object detail textures and are unable to reconstruct object portions that are occluded or unseen in views. To address this challenge, we delve into the meticulous 3D reconstruction of specific objects within large scenes and propose a framework termed OMEGAS: Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation. OMEGAS employs a multi-step approach, grounded in several excellent off-the-shelf methodologies. Specifically, initially, we utilize the Segment Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS), thereby creating a basic 3DGS model of the target object. Then, we leverage large-scale diffusion priors to further refine the details of the 3DGS model, especially aimed at addressing invisible or occluded object portions from the original scene views. Subsequently, by re-rendering the 3DGS model onto the scene views, we achieve accurate object segmentation and effectively remove the background. Finally, these target-only images are used to improve the 3DGS model further and extract the definitive 3D object mesh by the SuGaR model. In various scenarios, our experiments demonstrate that OMEGAS significantly surpasses existing scene reconstruction methods. Our project page is at: https://github.com/CrystalWlz/OMEGAS	Presents OMEGAS, a framework for extracting high-precision meshes of specified objects from multi-view scene images, even reconstructing occluded or unseen object parts.	Existing methods struggle to reconstruct accurate 3D object meshes from large scenes due to compromised object quality and difficulties in reconstructing occluded or unseen object portions.	OMEGAS leverages SAM for segmentation-guided 3DGS model creation, utilizes large-scale diffusion priors (Stable Diffusion) to refine details and address unseen parts, and employs SuGaR for final 3DGS optimization and mesh extraction.	Achieves superior segmentation accuracy and efficiency compared to Gaussian Grouping. Generates higher quality object meshes with finer details compared to SuGaR and DreamGaussian. Successfully reconstructs occluded or unseen object portions, as demonstrated in ablation studies.	The optimization process in SuGaR can be time-consuming, ranging from a few minutes to an hour. Future work could explore optimizing the framework's efficiency for even faster mesh extraction.	mesh reconstruction, 3d gaussian splatting, diffusion models, object segmentation, 3d reconstruction
2404.15889 Report	Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control	Linzi Qu, Jiaxiang Shang, Hui Ye, Xiaoguang Han, Hongbo Fu	Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.	\sysName is the first deep generative framework for synthesizing full-body human images from a semantic sketch for geometry control and a reference image for appearance control.	Existing solutions for full-body human image generation lack explicit and flexible control over detailed geometry and appearance, limiting the ability to generate specific images of interest.	\sysName employs a two-stage generation framework: (1) Sketch Image Inversion: inverts the input sketch into a geometry latent code using a sketch encoder trained on a synthetic dataset sampled from StyleGAN-Human. (2) Body Generator Tuning: fine-tunes a pretrained StyleGAN-Human with appearance-transferred and geometry-preserved data synthesized via style mixing to achieve disentangled geometry and appearance control.	\sysName enables flexible and disentangled control of geometry and appearance for full-body human image generation. Qualitative and quantitative evaluations demonstrate superior performance over state-of-the-art methods in terms of geometry preservation, appearance transfer, and visual quality. The system exhibits robustness in handling sketches with varying levels of abstraction, from professional to amateur styles.	The method's reliance on embedding sketches into StyleGAN-Human's latent space may prioritize reasonable results over perfectly replicating user intent in some cases. The system's ability to transfer complex textures from real appearance images is limited by the accuracy of the image inversion method and the generative power of the underlying StyleGAN-Human model.	full-body image generation, style-based generator, style mixing, sketch-based generation, disentangled geometry and appearance control
2404.15789 Report	MotionMaster: Training-free Camera Motion Transfer For Video Generation	Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, Lizhuang Ma	The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.	This paper introduces MotionMaster, a training-free model for transferring camera motion in videos, disentangling camera motion from object motion.	Existing camera motion control methods in video generation require extensive training, limiting their flexibility and computational efficiency.	MotionMaster leverages temporal attention maps in diffusion models to represent video motion. It disentangles camera and object motion using two methods: 1) One-shot: separating moving objects from the background and estimating camera motion in the foreground by solving a Poisson equation. 2) Few-shot: extracting common camera motion from multiple videos with similar camera movements through a window-based clustering technique. It further enables combining different camera motions for complex controls.	MotionMaster effectively disentangles camera motion from object motion in single or multiple videos. It enables flexible camera control by combining different camera motions and applying them to specific regions. Extensive experiments demonstrate superior performance in camera motion transfer, generation quality, and diversity compared to existing methods.	The accuracy of camera motion extraction might be affected by complex or rapid object movements. Future work could explore transferring more intricate camera motions, like those found in professional filmmaking.	video generation, camera motion transfer, motion disentanglement, training-free, temporal attention
2404.15677 Report	CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models	Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, Xu Jia	Recent advances in text-to-image models have opened new frontiers in human-centric generation. However, these models cannot be directly employed to generate images with consistent newly coined identities. In this work, we propose CharacterFactory, a framework that allows sampling new characters with consistent identities in the latent space of GANs for diffusion models. More specifically, we consider the word embeddings of celeb names as ground truths for the identity-consistent generation task and train a GAN model to learn the mapping from a latent space to the celeb embedding space. In addition, we design a context-consistent loss to ensure that the generated identity embeddings can produce identity-consistent images in various contexts. Remarkably, the whole model only takes 10 minutes for training, and can sample infinite characters end-to-end during inference. Extensive experiments demonstrate excellent performance of the proposed CharacterFactory on character creation in terms of identity consistency and editability. Furthermore, the generated characters can be seamlessly combined with the off-the-shelf image/video/3D diffusion models. We believe that the proposed CharacterFactory is an important step for identity-consistent character generation. Project page is available at: https://qinghew.github.io/CharacterFactory/.	CharacterFactory: an end-to-end framework that allows sampling of new, consistent character identities in the latent space of GANs for use in diffusion models, enabling the generation of images featuring the same character in different contexts.	Existing text-to-image models struggle to generate images with consistent characters across different contexts. Current subject-driven methods are computationally expensive, prone to overfitting, or require complex pipelines.	CharacterFactory leverages an Identity-Embedding GAN (IDE-GAN) trained on celebrity name embeddings to learn a mapping from a latent space to the embedding space of character identities. A context-consistent loss ensures that generated identity embeddings produce consistent images across various contexts.	CharacterFactory generates consistent, high-quality character images comparable to or exceeding existing methods. The method is highly efficient, requiring only 10 minutes for training and 3 seconds for inference. CharacterFactory exhibits strong generalization ability and integrates seamlessly with various image, video, and 3D diffusion models.	Potential generation of unnatural images or artifacts due to the use of GANs. Inherits limitations of the base diffusion model, such as hand anomalies in Stable Diffusion.	gans, diffusion models, identity-consistent generation, character creation, text-to-image synthesis
2404.15653 Report	CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data	Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari	Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}.	This paper introduces \method, a novel weakly supervised pre-training method for vision models on web-scale image-text data that reframes pre-training as a classification task, leading to a 2.7x speedup compared to contrastive learning methods like CLIP while maintaining comparable downstream task performance.	Contrastive learning on image-text pairs has shown great success in learning visual representations but suffers from computational challenges due to pairwise similarity computations.	\method treats image-text pre-training as a classification problem. It extracts nouns from text captions, maps them to WordNet synsets to generate multi-label classification targets, and trains the image encoder using a binary cross-entropy loss.	\method is 2.7x faster to pre-train than CLIP while maintaining comparable accuracy on downstream tasks. \method's performance scales effectively with both data and model size. Transfer learning with \method is more data-efficient, especially when leveraging the learned classification layer for initialization.	The performance of \method starts to saturate with very large models (ViT-H) on ImageNet. The gains from \method's classifier initialization are less pronounced for datasets where target labels are not a subset of the pre-training vocabulary.	image-text pre-training, weakly supervised learning, contrastive learning, classification, transfer learning
2404.15506 Report	Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation	Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen	We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2.	Introduces Metric3D v2, a foundation model for zero-shot metric depth and surface normal estimation from single images, achieving state-of-the-art performance on over 16 benchmarks.	Metric depth and surface normals are crucial 3D representations for applications like 3D reconstruction, rendering, and robotics, but existing methods suffer from metric ambiguity and limited zero-shot generalization due to data limitations.	1. A canonical camera transformation module addresses metric ambiguity by transforming training data to a canonical camera space. 2. A random proposal normalization loss enhances depth accuracy by focusing on local geometry. 3. A joint depth-normal optimization module distills knowledge from large-scale depth datasets to improve normal estimation, particularly in outdoor scenes.	Achieves state-of-the-art zero-shot performance on various metric depth, affine-invariant depth, and surface normal benchmarks. Outperforms previous methods in challenging cases, including fine-grained structures, foreground/background distinction, and unseen camera models. Enables accurate metric 3D reconstruction from single images, benefitting downstream tasks like SLAM and metrology.	The accuracy of normal prediction relies on depth estimation quality. Current normal prediction struggles with challenging cases such as reflections and thin structures. Exploring new normal representations and refinement strategies for challenging cases.	monocular depth estimation, surface normal estimation, zero-shot learning, 3d reconstruction, foundation models
2404.15449 Report	ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning	Weifeng Chen, Jiacheng Zhang, Jie Wu, Hefeng Wu, Xuefeng Xiao, Liang Lin	The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{https://idaligner.github.io/}}	ID-Aligner, a novel reward feedback learning framework, enhances identity-preserving text-to-image generation by improving identity consistency and visual appeal.	Existing ID-T2I methods struggle with accurate identity preservation, lack aesthetic appeal, and often lack compatibility with both LoRA and Adapter methods.	ID-Aligner leverages face detection and recognition models for identity consistency reward fine-tuning. It also uses human-annotated preference data and character structure feedback for identity aesthetic reward fine-tuning.	ID-Aligner significantly improves identity preservation compared to baseline models like IP-Adapter and FastComposer. The method enhances visual appeal, particularly in character structure, leading to more aesthetically pleasing generations. ID-Aligner demonstrates strong generalization across different base T2I models like Dreamshaper and RealVisXL.	Improvements might be marginal when applied to already robust existing models. Enhancing face similarity might sometimes compromise prompt consistency.	text-to-image generation, diffusion model, feedback learning, identity preservation, reward learning
2404.15406 Report	Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara	Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.	Proposes Wiki-LLaVa, the first Multimodal Large Language Model (MLLM) augmented with a retrieval module to leverage external knowledge from a multimodal document database for answering complex questions.	Standard MLLMs struggle to answer questions requiring specific or compositional reasoning due to limitations in their encoded knowledge and the scarcity of long-tail information in training data. Wiki-LLaVa addresses this by integrating external knowledge sources.	Employs a hierarchical retrieval pipeline using CLIP and Contriever to identify relevant documents and passages from an external knowledge base, then feeds this information as additional context to an LLaVA-based MLLM. The model is fine-tuned with a mix of knowledge-requiring and standard question-answer pairs.	Retrieving relevant passages from an external knowledge base significantly improves accuracy on knowledge-based visual question answering tasks, especially on the InfoSeek dataset. Using multiple retrieved passages as context generally enhances accuracy, highlighting the importance of rich external information. Employing oracle entities for retrieval considerably boosts accuracy, emphasizing the need for a robust entity retrieval model to minimize irrelevant content.	Defining better embedding spaces for improved document retrieval from questions and images is crucial. Developing efficient methods for selecting appropriate content from retrieved documents and enhancing the MLLM's ability to discern relevance are key areas for future work.	multimodal large language models, knowledge integration, retrieval augmentation, visual question answering, external knowledge bases
2404.15349 Report	A Survey on Multimodal Wearable Sensor-based Human Action Recognition	Jianyuan Ni, Hao Tang, Syed Tousiful Haque, Yan Yan, Anne H. H. Ngu	The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep learning approaches or on a single sensor modality. In real life, our human interact with the world in a multi-sensory way, where diverse information sources are intricately processed and interpreted to accomplish a complex and unified sensing system. To give machines similar intelligence, multimodal machine learning, which merges data from various sources, has become a popular research area with recent advancements. In this study, we present a comprehensive survey from a novel perspective on how to leverage multimodal learning to WSHAR domain for newcomers and researchers. We begin by presenting the recent sensor modalities as well as deep learning approaches in HAR. Subsequently, we explore the techniques used in present multimodal systems for WSHAR. This includes inter-multimodal systems which utilize sensor modalities from both visual and non-visual systems and intra-multimodal systems that simply take modalities from non-visual systems. After that, we focus on current multimodal learning approaches that have applied to solve some of the challenges existing in WSHAR. Specifically, we make extra efforts by connecting the existing multimodal literature from other domains, such as computer vision and natural language processing, with current WSHAR area. Finally, we identify the corresponding challenges and potential research direction in current WSHAR area for further improvement.	This paper presents a comprehensive survey on multimodal learning for wearable sensor-based human action recognition (WSHAR).	WSHAR has vast potential for applications like assistive technology, but existing surveys are limited in scope, focusing either on deep learning only or on single sensor modalities.	The survey covers recent sensor modalities, deep learning in HAR, inter- and intra-multimodal approaches, and multimodal solutions to WSHAR challenges.	WSHAR datasets with IMU data are limited compared to other modalities. Multimodal learning shows promise for addressing challenges like data scarcity and feature alignment in WSHAR. Future research directions include future activity prediction, identifying unknown activities, and developing unified multimodal systems.	The survey mainly focuses on the combination of IMU with other modalities, excluding some other potential modalities such as pressure sensors. Discussion on security and ethical considerations for multimodal WSHAR systems is limited.	multimodal learning, wearable sensors, human action recognition, deep learning, time series analysis
2404.15276 Report	SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation	Xiangyu Xu, Lijuan Liu, Shuicheng Yan	Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at https://github.com/xuxy09/SMPLer.	This paper proposes SMPLer, a novel Transformer framework for monocular 3D human shape and pose estimation, enabling the efficient use of high-resolution image features for improved accuracy.	Existing Transformers for this task struggle to utilize fine-grained information in high-resolution features due to quadratic computation and memory complexity, limiting their performance.	SMPLer introduces two key innovations: 1) decoupled attention to reduce complexity to linear w.r.t. feature length and 2) an SMPL-based target representation for a more compact and efficient embedding. Further, it incorporates multi-scale attention and joint-aware attention modules to leverage both global and local image information.	SMPLer significantly outperforms state-of-the-art methods on Human3.6M and 3DPW datasets, achieving a 10% lower MPJPE error with fewer parameters. The compact SMPL-based representation ensures smoother and more consistent 3D human meshes compared to vertex-based methods. The explicit modeling of body part rotations in SMPLer allows for efficient and accurate control of virtual avatars.	The current implementation still relies on a CNN backbone. Exploring attention-based backbones within the SMPLer framework could be a future research direction.	3d human shape and pose estimation, transformer, attention mechanism, multi-scale, smpl
2404.15275 Report	ID-Animator: Zero-Shot Identity-Preserving Human Video Generation	Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang	Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints will be released at https://github.com/ID-Animator/ID-Animator.	This paper proposes ID-Animator, a novel zero-shot framework for generating identity-specific human videos from a single facial image without further training.	Generating high-fidelity, identity-specific human videos is crucial in various fields like the film industry, but existing methods struggle to balance training efficiency, identity preservation, and instruction following.	ID-Animator combines a pretrained text-to-video diffusion model with a lightweight, trainable face adapter. It leverages an ID-oriented dataset with decoupled human attribute and action captions, and utilizes a random reference training strategy to enhance identity fidelity and instruction following.	ID-Animator outperforms previous methods in generating personalized human videos with higher identity fidelity and motion quality. The proposed framework allows for recontextualization of reference images by modifying attributes, backgrounds, and actions through text prompts. ID-Animator exhibits strong generalization capabilities, effectively integrating with ControlNet and community-trained models.	The current dataset primarily focuses on human subjects, limiting the generation to human-centric videos. Exploring alternative architectures for the face adapter could potentially further enhance its performance.	video generation, identity preservation, diffusion models, text-to-video synthesis, personalized content generation
2404.15267 Report	From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation	Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng	Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask information from the reference human images, allowing for the precise selection of any part. Extensive experiments demonstrate the superiority of our approach over existing alternatives, offering advanced capabilities for multi-part controllable human image customization. See our project page at https://huanngzh.github.io/Parts2Whole/.	This paper introduces Parts2Whole, a novel framework that leverages multiple reference images (e.g., hair, face, clothes) and pose maps to generate customizable human portraits.	Existing methods for controllable human image generation struggle to accurately synthesize images conditioned on multiple aspects of human appearance, limiting customization options for users.	Parts2Whole utilizes a dual U-Net design, incorporating a semantic-aware appearance encoder to extract detailed features from each labeled reference image. It then employs a shared self-attention mechanism to inject these features into the generation process, guided by subject masks for precise control.	Parts2Whole demonstrates superior quality and controllability compared to existing methods, accurately synthesizing human images with fine-grained details from multiple reference images. The framework allows for flexible combinations of body parts, enabling generation from single or multiple reference images with varying aspects. Evaluations using CLIP score, DINO score, DreamSim, and user studies confirm the effectiveness of Parts2Whole in generating high-quality and well-aligned human images.	Current training resolution of 512x512 might introduce artifacts, suggesting higher resolution and larger diffusion models for future improvement. Expanding the framework to achieve layer-wise clothing try-on is a promising avenue for future research.	controllable image generation, human image synthesis, multi-reference image generation, diffusion models, appearance control
2404.15264 Report	TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting	Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu	Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.	TalkingGaussian, a deformation-based radiance fields framework using 3D Gaussian Splatting for high-fidelity 3D talking head synthesis.	Existing NeRF-based methods struggle to synthesize accurate facial features due to difficulties in fitting abrupt appearance changes characteristic of facial movements.	The method represents the talking head with Deformable Gaussian Fields, using Persistent Gaussian Fields for static head structure and Grid-based Motion Fields to predict deformations applied to Gaussian primitives, representing facial movements. A Face-Mouth Decomposition module separates face and inside-mouth regions to improve motion accuracy. Incremental sampling strategy using facial action priors smooths the deformation learning process.	TalkingGaussian synthesizes high-quality, lip-synced talking head videos with superior facial fidelity compared to state-of-the-art methods. The framework achieves high generalization ability, effectively handling cross-domain audio inputs. TalkingGaussian demonstrates superior efficiency in both training and inference thanks to 3D Gaussian Splatting.	Random noisy primitives can occur during 3DGS densification, impacting quality. Alignment between face and inside-mouth branches relies solely on audio features, leading to potential misalignment in cross-domain scenarios.	talking head synthesis, 3d gaussian splatting, deformation-based, radiance fields, facial fidelity
2404.15263 Report	Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization	Lahav Lipson, Jia Deng	We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose	This paper introduces a new system for Multi-Session SLAM, which can track camera motion across multiple disjoint videos under a single global reference.	Handling disjoint videos in SLAM is important for many applications in AR and robotics where video data often consists of multiple non-continuous sessions.	The system couples optical flow prediction with differentiable solver layers to estimate camera pose. It utilizes a novel differentiable solver for wide-baseline two-view pose and is trained end-to-end.	The system is more accurate than prior Multi-Session SLAM approaches on EuRoC-MAV and ETH3D datasets. It is robust to catastrophic failures common in challenging scenarios. The two-view pose estimation component is competitive with transformer-based matching networks on Scannet and Megadepth datasets.	The two-view pose method is less competitive in photo-tourism settings where high-volume matching is easier. Future work could explore event cameras and inertial sensors to further improve robustness and accuracy.	slam, multi-session slam, visual odometry, differentiable solvers, optical flow
2404.15259 Report	FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent	Cameron Smith, David Charatan, Ayush Tewari, Vincent Sitzmann	This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).	FlowMap, an end-to-end differentiable method that recovers accurate camera poses, intrinsics, and dense depth maps from video sequences.	Enables novel view synthesis from unposed videos and paves the way for deep-learning based 3D reconstruction and scene understanding by being compatible with deep learning pipelines.	Minimizes a least-squares objective comparing the optical flow induced by depth, intrinsics, and poses against correspondences obtained from off-the-shelf optical flow and point tracking. Introduces differentiable feed-forward estimations of depth (via a neural network), pose (as a solution to a least-squares problem), and intrinsics (using a differentiable selection based on optical flow consistency).	FlowMap enables photorealistic novel view synthesis up to full 360° trajectories using Gaussian Splatting. Significantly outperforms prior gradient-descent based bundle adjustment methods. Performs on par with COLMAP on the downstream task of 360° novel view synthesis.	Less accurate and robust than COLMAP in terms of pose and intrinsics prediction. Requires more GPU memory and slightly longer runtime compared to COLMAP.	structure-from-motion, novel view synthesis, differentiable rendering, optical flow, point tracking
2404.15228 Report	Re-Thinking Inverse Graphics With Large Language Models	Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Abrevaya, Michael J. Black	Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understanding of the environment. This requirement limits the ability of existing carefully engineered approaches to generalize across domains. Inspired by the zero-shot ability of large language models (LLMs) to generalize to novel contexts, we investigate the possibility of leveraging the broad world knowledge encoded in such models in solving inverse-graphics problems. To this end, we propose the Inverse-Graphics Large Language Model (IG-LLM), an inverse-graphics framework centered around an LLM, that autoregressively decodes a visual embedding into a structured, compositional 3D-scene representation. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training. Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the use of image-space supervision. Our analysis opens up new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We will release our code and data to ensure the reproducibility of our investigation and to facilitate future research at https://ig-llm.is.tue.mpg.de/	This paper introduces IG-LLM, a novel framework leveraging Large Language Models (LLMs) for solving inverse graphics tasks, aiming to generate graphics programs from images for 3D scene reproduction.	Existing inverse graphics methods struggle with generalizing to novel scenes or objects. This work explores the potential of LLMs, with their strong generalization abilities and world knowledge, to overcome these limitations.	IG-LLM uses a pre-trained LLM enhanced with a visual encoder (CLIP) and a numeric head. It's trained on synthetic data (CLEVR and ShapeNet) with an instruction-tuning approach to predict graphics programs from images.	IG-LLM exhibits strong compositional generalization, outperforming the baseline NS-VQA by 60% in shape recognition accuracy on out-of-distribution CLEVR data. The integration of a numeric head enables IG-LLM to perform precise spatial reasoning, showing superior performance in 2D and SO(3) parameter space generalization tasks. IG-LLM demonstrates promising results in 6-DoF pose estimation, scaling to multi-object scenes and exhibiting generalization ability in both single-object and scene-level settings.	The expressiveness of IG-LLM is currently limited by the training data and code representation, potentially restricting its ability to handle complex real-world scenes. Addressing scenes with significant occlusions or complex arrangements might require a balance between the current generic approach and task-specific inductive biases.	inverse graphics, large language models, 3d scene understanding, compositional generalization, spatial reasoning
2404.15141 Report	CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method	Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, Rongrong Ji	Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.	This paper introduces CutDiffusion, a tuning-free diffusion extrapolation method for generating high-resolution images from pre-trained low-resolution diffusion models.	Training high-resolution diffusion models from scratch is computationally expensive and time-consuming. Diffusion extrapolation leverages pre-trained models to generate higher-resolution images efficiently.	CutDiffusion divides the image generation process into two stages: (1) Comprehensive Structure Denoising: Randomly sampled non-overlapping patches undergo denoising with pixel interaction to ensure similar content across patches. (2) Specific Detail Refinement: Structurally-enhanced patches are reassembled into a higher-resolution latent, followed by denoising with overlapping patches to refine details.	CutDiffusion is simple to implement, requiring only modification to the sub-patch sampling approach. CutDiffusion achieves fast inference speeds, comparable to direct inference methods and significantly faster than existing patch-wise methods. CutDiffusion maintains low GPU memory consumption, making it more accessible than methods demanding high GPU resources.	The generated high-resolution image quality relies on the quality of the pretrained diffusion model. The second stage, using overlapping patches, limits further speed improvements.	image generation, high resolution, diffusion model, diffusion extrapolation, tuning-free
2404.15100 Report	Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation	Xun Wu, Shaohan Huang, Furu Wei	Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.	This paper introduces \our{}, a large-scale, high-quality, and diversified preference dataset for text-to-image generative alignment, constructed using feedback from multimodal large language models (MLLMs).	Existing human preference datasets for aligning text-to-image generative models are expensive to construct and limited in scale and diversity, hindering the development of more aligned models.	The authors leverage MLLMs to generate preferences for images generated by different text-to-image models based on a curated prompt set. The preferences cover four aspects: prompt-following, fidelity, aesthetic, and harmlessness, providing both numerical scores and textual explanations.	The resulting dataset, \our{}, is significantly larger and more fine-grained than existing human-annotated datasets. The authors train a reward model, \ourscore{}, on \our{} and show it achieves comparable performance to reward models trained on human preferences. Fine-tuning generative models using \ourscore{} and \our{} with PPO and DPO, respectively, leads to significant improvements in image quality and alignment with human preferences across various aspects.	The textual explanations in \our{} are not fully utilized. The issue of image distortion, while mitigated, is not completely solved and requires further research.	text-to-image generation, preference learning, reinforcement learning from ai feedback, multimodal large language models, ai-synthesized data
2404.15014 Report	OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving	Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma	Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.	Introduces OccGen, a generative model for 3D semantic occupancy prediction, which progressively refines the occupancy map by predicting and eliminating noise, leading to better detail and scene completion.	Addresses limitations of discriminative methods that lack gradual refinement and struggle with local scene completion in 3D occupancy prediction.	Uses a 'noise-to-occupancy' paradigm with a conditional encoder for multi-modal inputs and a progressive refinement decoder applying diffusion denoising using these inputs.	Outperforms state-of-the-art methods on nuScenes-Occupancy and SemanticKITTI benchmarks. Offers flexible compute-accuracy trade-off through progressive inference. Provides uncertainty estimates alongside predictions.	Current latency comparable to existing methods, future work aims for lightweight architecture. Potential for bias in the model based on training data, needing careful consideration for real-world deployment.	occupancy prediction, generative model, diffusion model, autonomous driving, multi-modal learning
2404.14967 Report	CoARF: Controllable 3D Artistic Style Transfer for Radiance Fields	Deheng Zhang, Clara Fernandez-Labrador, Christopher Schroers	Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.	Introduces CoARF, an algorithm for controllable 3D scene stylization using radiance fields, enabling object-specific, compositional, and semantic-aware style transfer.	Addresses the limitations of existing 3D scene stylization methods by providing fine-grained control over the style transfer process for more precise and user-specified results.	Utilizes a multi-view 2D mask-based optimization framework with label-dependent loss functions and a novel semantic-aware nearest neighbor matching (SANNFM) algorithm.	Enables users to selectively stylize specific objects within a scene while preserving the photorealism of other elements. Allows for the application of different styles to different parts of the 3D scene through compositional style transfer. Achieves superior style transfer quality, particularly in semantically sensitive scenarios, by leveraging both VGG and LSeg features for improved feature matching.	Large scale differences between scene and style objects can lead to undesired stylization results. Freezing the density field during optimization may limit the richness of the stylization outcomes.	3d scene stylization, radiance fields, semantic-aware style transfer, controllable artistic style, neural rendering
2404.14966 Report	Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model	Xu Han, Yuan Tang, Zhaoxuan Wang, Xianzhi Li	Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity.	This paper proposes \ours, a novel state space model (SSM) tailored for 3D point cloud learning that leverages Mamba's efficiency while addressing its limitations for unordered points and local feature extraction.	Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, limiting their scalability. Mamba, based on SSM, offers linear complexity but lacks effective adaptation for point clouds.	The paper introduces two key components: (1) Local Norm Pooling (LNP) block for local feature extraction, using K-norm for propagation and K-pooling for aggregation. (2) Bidirectional-SSM (bi-SSM) with a token forward SSM and a novel feature reverse backward SSM (C-SSM) to capture global features while mitigating pseudo-order reliance.	\ours achieves state-of-the-art (SoTA) results on ScanObjectNN classification, outperforming previous methods even when trained from scratch. It demonstrates superior performance in few-shot learning on ModelNet40, highlighting its ability to learn from limited data. The model consistently outperforms Transformer-based counterparts across various tasks, including object classification and part segmentation, with reduced parameters and FLOPs.	The pre-training benefits are not as significant as in Transformers, potentially due to limitations of masked point modeling for recurrent models like Mamba. Future work will focus on exploring tailored pre-training strategies and scaling up the model to further exploit its linear complexity advantage.	point cloud analysis, state space model, local feature, mamba, linear complexity
2404.14768 Report	Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion	Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng	Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.	This paper introduces Mask-guided Prompt Following (MGPF), a training-free approach for improving prompt following in text-to-image synthesis models that use visual controls (like ControlNet), specifically addressing misalignment between text prompts and visual cues.	Existing text-to-image models with visual controls often struggle to accurately reflect text prompts, particularly when there's a misalignment between the prompt and the visual control. This leads to inaccuracies in generated images, such as missing objects or mismatched attributes, limiting the controllability and quality of image generation.	MGPF uses object masks to separate aligned and misaligned portions of the visual control. It introduces Masked ControlNet, which utilizes these masks to focus on relevant visual features, and an Attribute-matching Loss to ensure attributes in the text prompt are correctly reflected in the generated image.	MGPF outperforms existing training-free methods in aligning generated images with both text prompts and visual controls, as measured by text-image similarity, VQA-based metrics, and human evaluation. The Masked ControlNet effectively addresses the 'object missing' problem by focusing on relevant visual features and allowing the model to generate objects based on the text prompt, even when misaligned with the visual control. The Attribute-matching Loss successfully tackles the 'attribute mismatch' problem, ensuring that attributes in the text prompt are accurately reflected in the generated image without disrupting the visual control.	The method faces challenges in complex scenarios involving attribute matching for multiple or small objects due to the limited resolution of attention maps. Future work could explore incorporating cross-attention in higher-resolution layers to enhance localized attribute binding.	text-to-image synthesis, visual control, prompt following, controlnet, attribute matching
2404.14743 Report	Gradient Guidance for Diffusion Models: An Optimization Perspective	Yingqing Guo, Hui Yuan, Yukang Yang, Minshuo Chen, Mengdi Wang	Diffusion models have demonstrated empirical successes in various applications and can be adapted to task-specific needs via guidance. This paper introduces a form of gradient guidance for adapting or fine-tuning diffusion models towards user-specified optimization objectives. We study the theoretic aspects of a guided score-based sampling process, linking the gradient-guided diffusion model to first-order optimization. We show that adding gradient guidance to the sampling process of a pre-trained diffusion model is essentially equivalent to solving a regularized optimization problem, where the regularization term acts as a prior determined by the pre-training data. Diffusion models are able to learn data's latent subspace, however, explicitly adding the gradient of an external objective function to the sample process would jeopardize the structure in generated samples. To remedy this issue, we consider a modified form of gradient guidance based on a forward prediction loss, which leverages the pre-trained score function to preserve the latent structure in generated samples. We further consider an iteratively fine-tuned version of gradient-guided diffusion where one can query gradients at newly generated data points and update the score network using new samples. This process mimics a first-order optimization iteration in expectation, for which we proved O(1/K) convergence rate to the global optimum when the objective function is concave.	The paper introduces a novel gradient-based guidance method for diffusion models, allowing them to be adapted for generating samples that optimize user-specified objectives while preserving the learned data structure.	This is important because it bridges the gap between generative AI and optimization, enabling efficient optimization in complex design spaces (images, videos, proteins, etc.) where traditional methods struggle.	The paper proposes a gradient guidance based on a forward prediction loss, which leverages the pre-trained score function to preserve the latent subspace structure of the data. They analyze two algorithms: one iteratively updates the guidance using newly queried gradients, and another additionally fine-tunes the score network with self-generated samples.	Iteratively applying gradient guidance with a pre-trained score function generates samples whose expectation converges to a solution regularized with respect to the original data distribution. The pre-trained score function acts as a prior, limiting the extent to which the model can be adapted away from the original data distribution. Adaptively fine-tuning the score network using self-generated samples allows the model to converge to the global optimum of the objective function within the latent subspace, achieving a convergence rate comparable to classical convex optimization.	Theoretical analysis focuses on the class of linear score functions, while experiments utilize a more complex U-Net architecture. The paper primarily focuses on concave objective functions. Future work could explore extensions to non-concave objectives.	diffusion models, generative ai, optimization, gradient guidance, score matching
2404.14676 Report	DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance	Linxuan Xin, Zheng Zhang, Jinfu Wei, Ge Li, Duan Gao	Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets. To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by text and multi-modal controls, providing high controllability and diversity in material generation. Key to achieving diverse and high-quality PBR material generation lies in integrating the capabilities of recent large-scale vision-language models trained on billions of text-image pairs, along with material priors derived from hundreds of PBR material samples. We utilize a novel material Latent Diffusion Model (LDM) to establish the mapping between albedo maps and the corresponding latent space. The latent representation is then decoded into full SVBRDF parameter maps using a rendering-aware PBR decoder. Our method supports tileable generation through convolution with circular padding. Furthermore, we introduce a multi-modal guidance module, which includes pixel-aligned guidance, style image guidance, and 3D shape guidance, to enhance the control capabilities of the material LDM. We demonstrate the effectiveness of DreamPBR in material creation, showcasing its versatility and user-friendliness on a wide range of controllable generation and editing applications.	DreamPBR, a novel diffusion-based generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) guided by text and multi-modal controls.	Prior material creation methods were limited in producing diverse results due to relying on real-world measurements or training on small datasets.	The method integrates pre-trained text-to-image diffusion models with material priors, using a two-stage material Latent Diffusion Model (LDM) and a rendering-aware PBR decoder. It also incorporates multi-modal guidance modules for pixel control, style control, and shape control.	DreamPBR generates semantically correct and detailed materials based on various textual prompts, ranging from structured to imaginative. The method supports tileable generation through convolution with circular padding. DreamPBR enables a wide range of controllable generation and editing applications, showcasing its versatility and user-friendliness.	Current implementation uses normal maps without displacement maps, leading to ignoring self-occlusion during rendering. Generating detailed textures requires users to craft lengthy descriptions.	physically-based rendering, svbrdf, diffusion models, text-to-image synthesis, multi-modal learning
2404.14674 Report	HOIN: High-Order Implicit Neural Representations	Yang Chen, Ruituo Wu, Yipeng Liu, Ce Zhu	Implicit neural representations (INR) suffer from worsening spectral bias, which results in overly smooth solutions to the inverse problem. To deal with this problem, we propose a universal framework for processing inverse problems called \textbf{High-Order Implicit Neural Representations (HOIN)}. By refining the traditional cascade structure to foster high-order interactions among features, HOIN enhances the model's expressive power and mitigates spectral bias through its neural tangent kernel's (NTK) strong diagonal properties, accelerating and optimizing inverse problem resolution. By analyzing the model's expression space, high-order derivatives, and the NTK matrix, we theoretically validate the feasibility of HOIN. HOIN realizes 1 to 3 dB improvements in most inverse problems, establishing a new state-of-the-art recovery quality and training efficiency, thus providing a new general paradigm for INR and paving the way for it to solve the inverse problem.	This paper introduces HOIN (High-Order Implicit Neural Representations), a novel framework designed to enhance the performance of Implicit Neural Representations (INRs) in tackling inverse problems.	Traditional INRs struggle with spectral bias, resulting in overly smooth solutions lacking crucial high-frequency details. Existing mitigation strategies are often task-specific and fail to fully restore high-frequency details, highlighting the need for a universally applicable and effective solution.	HOIN integrates high-order interaction blocks into INRs, expanding their functional space to capture richer, high-frequency information. This is achieved through a combination of suitable encoding layers (e.g., Hash Table, Position Encoding, Fourier Features) and a novel High-Order (HO) block architecture facilitating complex feature interactions.	HOIN significantly improves image representation abilities, achieving higher PSNR values compared to baseline models, with HO-FFN demonstrating superior performance. In image denoising, HO-Pos.Enc excels due to its moderate acceleration in high-frequency learning, outperforming models that aggressively mitigate spectral bias and blend noise with signal details. HOIN consistently enhances performance in super-resolution, CT reconstruction, and image inpainting tasks, with HO-SIREN and HO-FFN consistently achieving superior results compared to other INR-based methods.	While HOIN effectively mitigates spectral bias, careful consideration is needed regarding the degree of acceleration in high-frequency learning to avoid incorporating noise in specific inverse problems. Future work could explore the adaptation of HOIN to other domains beyond image processing, such as audio or 3D model reconstruction, to further evaluate its generalizability and effectiveness.	implicit neural representation, inverse problem, spectral bias, high-frequency information, neural tangent kernel
2404.14667 Report	3DFlowRenderer: One-shot Face Re-enactment via Dense 3D Facial Flow Estimation	Siddharth Nijhawan, Takuya Yashima, Tamaki Kojima	Performing facial expression transfer under one-shot setting has been increasing in popularity among research community with a focus on precise control of expressions. Existing techniques showcase compelling results in perceiving expressions, but they lack robustness with extreme head poses. They also struggle to accurately reconstruct background details, thus hindering the realism. In this paper, we propose a novel warping technology which integrates the advantages of both 2D and 3D methods to achieve robust face re-enactment. We generate dense 3D facial flow fields in feature space to warp an input image based on target expressions without depth information. This enables explicit 3D geometric control for re-enacting misaligned source and target faces. We regularize the motion estimation capability of the 3D flow prediction network through proposed "Cyclic warp loss" by converting warped 3D features back into 2D RGB space. To ensure the generation of finer facial region with natural-background, our framework only renders the facial foreground region first and learns to inpaint the blank area which needs to be filled due to source face translation, thus reconstructing the detailed background without any unwanted pixel motion. Extensive evaluation reveals that our method outperforms state-of-the-art techniques in rendering artifact-free facial images.	This paper proposes 3DFlowRenderer, a novel one-shot face re-enactment framework that leverages dense 3D facial flow estimation to enhance robustness and realism, especially in extreme head pose variations.	Existing methods struggle with extreme head poses and accurate background reconstruction, limiting the realism of face re-enactment. This work addresses these limitations by integrating the strengths of both 2D and 3D methods.	The proposed 3DFlowRenderer employs a four-stage process: 1) Pre-processing: separates foreground and background, estimates 3DMM parameters for target motion; 2) 3D Warping: computes dense 3D facial flow fields to warp source foreground based on target expressions; 3) Image Refinement: refines the warped foreground using a TransUNet block; and 4) Image Inpainting: projects refined foreground onto the source background and inpaints the missing regions using another TransUNet block.	Outperforms state-of-the-art methods in terms of realism (FID), noise reduction (PSNR), reconstruction quality (SSIM), identity preservation (CSIM), and motion transfer accuracy (AED, AKD, APD). Demonstrates robustness to extreme head pose and expression variations. Successfully renders finer facial details and preserves background information without leakage or unwanted motion.	The accuracy of 3DMM parameter estimation can impact the overall performance. Future work includes extending the framework for handling occlusions and incorporating temporal consistency for video re-enactment.	face re-enactment, one-shot, 3d warping, image-to-image synthesis, 3dmm
2404.14581 Report	The Adversarial AI-Art: Understanding, Generation, Detection, and Benchmarking	Yuying Li, Zeyan Liu, Junyi Zhao, Liangqin Ren, Fengjun Li, Jiebo Luo, Bo Luo	Generative AI models can produce high-quality images based on text prompts. The generated images often appear indistinguishable from images generated by conventional optical photography devices or created by human artists (i.e., real images). While the outstanding performance of such generative models is generally well received, security concerns arise. For instance, such image generators could be used to facilitate fraud or scam schemes, generate and spread misinformation, or produce fabricated artworks. In this paper, we present a systematic attempt at understanding and detecting AI-generated images (AI-art) in adversarial scenarios. First, we collect and share a dataset of real images and their corresponding artificial counterparts generated by four popular AI image generators. The dataset, named ARIA, contains over 140K images in five categories: artworks (painting), social media images, news photos, disaster scenes, and anime pictures. This dataset can be used as a foundation to support future research on adversarial AI-art. Next, we present a user study that employs the ARIA dataset to evaluate if real-world users can distinguish with or without reference images. In a benchmarking study, we further evaluate if state-of-the-art open-source and commercial AI image detectors can effectively identify the images in the ARIA dataset. Finally, we present a ResNet-50 classifier and evaluate its accuracy and transferability on the ARIA dataset.	This paper presents ARIA, a comprehensive dataset of adversarial AI-generated art, and investigates the challenges in detecting such art by both humans and AI detectors.	The rise of AI-generated art poses significant risks, including social media fraud, fake news, and art style imitation, necessitating a better understanding and reliable detection methods.	The authors collected a large-scale dataset (ARIA) of real and AI-generated images across five categories. They conducted a user study to assess human detection ability and benchmarked various open-source and commercial AI image detectors.	Human users struggle to distinguish real from AI-generated images, even with references. Most open-source and commercial detectors exhibit unsatisfactory accuracy, especially for images generated with both text and image prompts. Supervised classifiers trained on ARIA show promise, with models trained on Midjourney data demonstrating better generalizability.	The dataset, while extensive, may not encompass the full spectrum of future AI models. Budget limitations restricted the evaluation of some commercial detectors.	aigc, ai-generated images, ai-art, adversarial attacks, image detection
2404.14507 Report	Align Your Steps: Optimizing Sampling Schedules in Diffusion Models	Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis	Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond. A crucial drawback of DMs is their slow sampling speed, relying on many sequential function evaluations through large neural networks. Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. While past works primarily focused on deriving efficient solvers, little attention has been given to finding optimal sampling schedules, and the entire literature relies on hand-crafted heuristics. In this work, for the first time, we propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs, called $\textit{Align Your Steps}$. We leverage methods from stochastic calculus and find optimal schedules specific to different solvers, trained DMs and datasets. We evaluate our novel approach on several image, video as well as 2D toy data synthesis benchmarks, using a variety of different samplers, and observe that our optimized schedules outperform previous hand-crafted schedules in almost all experiments. Our method demonstrates the untapped potential of sampling schedule optimization, especially in the few-step synthesis regime.	A novel framework, named Align Your Steps (AYS), is introduced for optimizing sampling schedules in diffusion models, particularly beneficial for generating high-quality outputs in few-step synthesis.	Diffusion models (DMs) are powerful but suffer from slow sampling speed due to sequential function evaluations. Optimizing sampling schedules, a previously overlooked aspect, can significantly enhance output quality and efficiency.	The methodology leverages stochastic calculus to minimize the Kullback-Leibler divergence between the true generative SDE and a solver-specific linearized SDE. This is formulated as an optimization problem over the sampling schedule, solved iteratively using Monte Carlo integration with time-based importance sampling.	Optimized schedules consistently outperform hand-crafted schedules across various datasets (2D toy data, CIFAR10, FFHQ, ImageNet), models (Stable Diffusion, SDXL, DeepFloyd-IF, Stable Video Diffusion), and solvers. Significant quality improvements are observed in the low NFE (Number of Function Evaluations) regime, with optimized schedules sometimes achieving quality comparable to default schedules with 1.5x fewer steps. Optimized schedules derived for one solver often generalize well to other solvers, both stochastic and deterministic.	The optimization objective is an upper bound on the discretization error, necessitating an early stopping mechanism to avoid over-optimization. Optimizing schedules for conditional diffusion models, where the optimal schedule might vary depending on the conditioning input, needs further exploration.	diffusion models, sampling schedules, generative modeling, stochastic calculus, optimization
2404.14410 Report	Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses	Inhee Lee, Byungjun Kim, Hanbyul Joo	In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.	This paper proposes a novel method for reconstructing dynamic 3D scenes with multiple humans from monocular videos, addressing the challenges of sparse and limited observations.	Reconstructing 4D scenes from monocular videos is crucial for various applications, but existing methods struggle with realistic human representation, especially under sparse observations.	The method leverages 3D Gaussian Splatting to represent both the static world and dynamic humans, enabling efficient composing and rendering. It introduces a novel canonical space optimization approach that fuses sparse cues and utilizes a pre-trained 2D diffusion model with Texture Inversion to synthesize unseen human body parts, ensuring consistency with observed appearances.	The method successfully reconstructs high-quality animatable 3D human avatars, even with severe occlusions and limited viewpoints. It demonstrates superior performance compared to existing approaches on challenging datasets like Panoptic and Hi4D. The proposed approach offers high computational efficiency, achieving real-time novel pose rendering speed.	The method currently relies on provided SMPL fitting and primarily focuses on humans as dynamic objects. Future work could explore integrating SMPL estimation within the pipeline and extending the approach to encompass various dynamic objects beyond humans.	3d scene reconstruction, monocular video, dynamic humans, sparse observations, diffusion models
2404.14409 Report	CrossScore: Towards Multi-View Image Evaluation and Scoring	Zirui Wang, Wenjing Bian, Omkar Parkhi, Yuheng Ren, Victor Adrian Prisacariu	We introduce a novel cross-reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes -- ranging from full-reference metrics like SSIM, no-reference metrics such as NIQE, to general-reference metrics including FID, and Multi-modal-reference metrics, e.g., CLIPScore. Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-reference metric SSIM, while not requiring ground truth references.	This paper introduces CrossScore, a novel cross-reference image quality assessment (CR-IQA) method for evaluating image quality using multiple unregistered reference views of the same scene.	Existing IQA methods, relying on full-reference, no-reference, general-reference, or multi-modal-reference schemes, are inadequate for tasks like novel view synthesis (NVS) where ground truth references are unavailable for true novel views.	The method utilizes a neural network with a cross-attention mechanism. It predicts a score map approximating the SSIM score by comparing a query image with a set of multi-view reference images. The model is trained using a self-supervised approach, leveraging NVS algorithms to generate distorted images and their corresponding SSIM maps.	CrossScore exhibits a strong correlation with the full-reference SSIM score without requiring ground truth reference images. Trained solely on the Map-free Relocalisation (MFR) dataset, CrossScore generalizes well to other datasets, demonstrating its versatility. CrossScore effectively evaluates NVS renderings from true novel trajectories without ground truth, aligning with traditional SSIM-based evaluations.	The score maps generated by CrossScore lack the sharpness of full-reference SSIM, potentially due to patch-wise encoding. The method faces challenges in evaluating unconventional images, such as those from fish-eye lenses, leading to inaccurate predictions.	image quality assessment, novel view synthesis, cross-reference, cross-attention, self-supervised learning
2404.14403 Report	GeoDiffuser: Geometry-Based Image Editing with Diffusion Models	Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, Srinath Sridhar	The success of image generative models has enabled us to build methods that can edit images based on text or other user input. However, these methods are bespoke, imprecise, require additional information, or are limited to only 2D image edits. We present GeoDiffuser, a zero-shot optimization-based method that unifies common 2D and 3D image-based object editing capabilities into a single method. Our key insight is to view image editing operations as geometric transformations. We show that these transformations can be directly incorporated into the attention layers in diffusion models to implicitly perform editing operations. Our training-free optimization method uses an objective function that seeks to preserve object style but generate plausible images, for instance with accurate lighting and shadows. It also inpaints disoccluded parts of the image where the object was originally located. Given a natural image and user input, we segment the foreground object using SAM and estimate a corresponding transform which is used by our optimization approach for editing. GeoDiffuser can perform common 2D and 3D edits like object translation, 3D rotation, and removal. We present quantitative results, including a perceptual study, that shows how our approach is better than existing methods. Visit https://ivl.cs.brown.edu/research/geodiffuser.html for more information.	\coolname is a novel zero-shot optimization-based method for 2D and 3D image editing that leverages the power of pre-trained diffusion models. It unifies various image editing capabilities, such as object translation, rotation, scaling, and removal, into a single framework by treating these operations as geometric transformations directly incorporated into the attention layers of diffusion models.	Existing image editing methods often require bespoke solutions, lack precision, demand additional information (e.g., text prompts, optical flow), or are limited to 2D edits. \coolname overcomes these limitations by providing a unified and flexible approach for realistic and style-preserving image editing in both 2D and 3D.	\coolname employs a shared attention mechanism within a diffusion model's editing framework. First, it performs DDIM inversion on the input image to obtain a latent noise trajectory. Then, it applies user-specified geometric transformations to the query embeddings of the reference attention layer, guiding the edit diffusion process. An optimization procedure, incorporating losses for background preservation, object preservation, inpainting, and smoothness, refines the edited image while ensuring realism and style consistency.	Qualitative results demonstrate \coolname's capability to perform a variety of realistic 2D and 3D edits, including object translation, rotation, scaling, and removal, while preserving object style, lighting, shadows, and reflections. Quantitative evaluation, including a perceptual study, shows that users significantly prefer \coolname's editing results over existing methods like LaMa and Zero123-XL for realism, adherence to the desired edit, and inpainting quality. Metrics such as Mean Distance and Warp Error confirm \coolname's superior performance in accurately transforming foreground objects and adhering to user-specified edits compared to baselines.	\coolname currently struggles with foreground object disocclusions arising from significant 3D motions. The method occasionally produces artifacts due to downsampled attention masks.	image editing, diffusion models, geometric transformations, shared attention, zero-shot learning
2404.14396 Report	SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan	The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.	SEED-X, a versatile multimodal foundation model that integrates image comprehension of arbitrary sizes and multi-granularity image generation for real-world applications.	Existing multimodal models struggle to effectively respond to user instructions and interact with diverse visual data in real-world scenarios.	The authors incorporate a visual tokenizer for unified image comprehension and generation, dynamic resolution image encoding for arbitrary image size handling, and multi-stage training including pre-training on massive data and instruction tuning on domain-specific datasets.	SEED-X achieves state-of-the-art image generation results on SEED-Bench-2, outperforming previous unified comprehension and generation models. The model demonstrates strong performance in multimodal comprehension tasks, achieving competitive results on benchmarks like MMB and SEED-Bench-2. Qualitative evaluations showcase SEED-X's capabilities as a multimodal AI assistant, excelling in tasks like image editing, text-rich comprehension, and creative image generation.	The paper lacks an all-in-one instruction-tuned model, focusing on domain-specific fine-tuning instead. The advantage of dynamic resolution encoding is not fully demonstrated due to limited data with unusual aspect ratios in existing benchmarks.	multimodal foundation model, image comprehension, image generation, instruction tuning, real-world applications
2404.14368 Report	Graphic Design with Large Multimodal Model	Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan Li, Xinglong Wu, Jie Shao	In the field of graphic design, automating the integration of design elements into a cohesive multi-layered artwork not only boosts productivity but also paves the way for the democratization of graphic design. One existing practice is Graphic Layout Generation (GLG), which aims to layout sequential design elements. It has been constrained by the necessity for a predefined correct sequence of layers, thus limiting creative potential and increasing user workload. In this paper, we present Hierarchical Layout Generation (HLG) as a more flexible and pragmatic setup, which creates graphic composition from unordered sets of design elements. To tackle the HLG task, we introduce Graphist, the first layout generation model based on large multimodal models. Graphist efficiently reframes the HLG as a sequence generation problem, utilizing RGB-A images as input, outputs a JSON draft protocol, indicating the coordinates, size, and order of each element. We develop new evaluation metrics for HLG. Graphist outperforms prior arts and establishes a strong baseline for this field. Project homepage: https://github.com/graphic-design-ai/graphist	This paper introduces Hierarchical Layout Generation (HLG), a new task for creating graphic compositions from unordered design elements, and presents Graphist, the first large multimodal model (LMM) for this task.	HLG overcomes the limitations of previous Graphic Layout Generation (GLG) methods by removing the need for predefined layer ordering, allowing for greater flexibility and practicality in AI-assisted graphic design.	Graphist reframes HLG as a sequence generation problem, taking RGB-A images as input and outputting a JSON draft protocol specifying element positions, sizes, and order.	Graphist outperforms existing methods on GLG tasks and establishes a strong baseline for HLG. New evaluation metrics for HLG are introduced: Inverse Order Pair Ratio (IOPR) for layer order accuracy and GPT-4V Eval for overall aesthetic quality. Ablation studies demonstrate the importance of input sequence flexibility, LLM choice, visual token length, and the use of RGB-A over RGB images.	Generating complete sets of high-quality design materials and aligning designs more closely with human aesthetics require further research. Potential negative impacts include design homogeneity and the environmental cost of model training.	graphic design, layout generation, lmm, mllm, hlg
2404.14249 Report	CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding	Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, Kanglin Liu	The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time synthesis of novel views in 3D scenes. Currently, it primarily focuses on geometry and appearance modeling, while lacking the semantic understanding of scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to efficiently comprehend 3D environments without annotated semantic data. In specific, rather than straightforwardly learning and rendering high-dimensional semantic features of 3D Gaussians, which significantly diminishes the efficiency, we propose a Semantic Attribute Compactness (SAC) approach. SAC exploits the inherent unified semantics within objects to learn compact yet effective semantic representations of 3D Gaussians, enabling highly efficient rendering (>100 FPS). Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model. 3DCS imposes cross-view semantic consistency constraints by leveraging refined, self-predicted pseudo-labels derived from the trained 3D Gaussian model, thereby enhancing precise and view-consistent segmentation results. Extensive experiments demonstrate that our method remarkably outperforms existing state-of-the-art approaches, achieving improvements of 17.29% and 20.81% in mIoU metric on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, verifying the robustness of our method.	This paper introduces CLIP-GS, a novel method for real-time and accurate semantic understanding of 3D scenes using Gaussian Splatting. It leverages the inherent efficiency of Gaussian Splatting and incorporates semantic information from CLIP.	Existing methods for 3D scene understanding either lack semantic comprehension or suffer from slow rendering speeds, hindering real-time applications like robotics and AR/VR.	CLIP-GS addresses these limitations through two key innovations: 1) Semantic Attribute Compactness (SAC): Efficiently represents scene semantics by learning compact embeddings for 3D Gaussians. 2) 3D Coherent Self-training (3DCS): Enhances semantic consistency across different views by leveraging cross-view self-predicted semantics.	Significantly outperforms state-of-the-art methods in both semantic segmentation accuracy and rendering efficiency on Replica and ScanNet datasets. Achieves over 17% and 20% improvement in mIoU over the second-best method on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed (>100 FPS). Exhibits superior robustness compared to existing approaches, achieving high-quality reconstruction and segmentation even with sparse input data.	The current implementation primarily focuses on indoor scenes and could be extended to handle more complex outdoor environments. Exploring the integration of temporal information for dynamic scene understanding presents a promising direction for future research.	3d gaussian splatting, real-time, view-consistent, 3d scene semantic understanding, 3d scene reconstruction
2404.14239 Report	MultiBooth: Towards Generating All Your Concepts in an Image from Text	Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Li Xiu	This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency. Project Page: https://multibooth.github.io/	This paper introduces MultiBooth, a novel and efficient two-phase method for multi-concept customization in text-to-image generation, addressing the limitations of existing techniques in handling multiple customized subjects.	Existing customized generation methods primarily focus on single-concept customization and struggle to generate high-fidelity images with multiple customized subjects while preserving text alignment.	MultiBooth employs a two-phase approach: single-concept learning using a multi-modal encoder, adaptive concept normalization, and efficient concept encoding, followed by multi-concept integration using a regional customization module within the cross-attention layers of the U-Net.	MultiBooth achieves superior image quality, faithfulness to intended concepts, and alignment with text prompts compared to state-of-the-art methods. The method demonstrates high efficiency in both training and inference time due to its single-concept learning and regional customization module. The framework exhibits flexibility and can be seamlessly integrated with other techniques like LoRA-based DreamBooth and ControlNet for enhanced customization.	The current method still requires training for learning new concepts. Future work will focus on exploring training-free multi-concept customization based on MultiBooth.	text-to-image generation, personalized image generation, multi-concept customization, diffusion models, adaptive concept normalization
2404.14199 Report	Generalizable Neural Human Renderer	Mana Masuda, Jinhyung Park, Shun Iwase, Rawal Khirodkar, Kris Kitani	While recent advancements in animatable human rendering have achieved remarkable results, they require test-time optimization for each subject which can be a significant limitation for real-world applications. To address this, we tackle the challenging task of learning a Generalizable Neural Human Renderer (GNH), a novel method for rendering animatable humans from monocular video without any test-time optimization. Our core method focuses on transferring appearance information from the input video to the output image plane by utilizing explicit body priors and multi-view geometry. To render the subject in the intended pose, we utilize a straightforward CNN-based image renderer, foregoing the more common ray-sampling or rasterizing-based rendering modules. Our GNH achieves remarkable generalizable, photorealistic rendering with unseen subjects with a three-stage process. We quantitatively and qualitatively demonstrate that GNH significantly surpasses current state-of-the-art methods, notably achieving a 31.3% improvement in LPIPS.	The paper introduces GNH, a generalizable neural human renderer that generates animatable humans from monocular videos without test-time optimization.	Existing animatable human rendering methods require time-consuming per-subject optimization, limiting their practical application.	GNH uses a three-stage process: 1) appearance feature extraction from input video frames, 2) feature transformation to the target pose and projection to 2D, 3) multi-frame feature fusion and rendering using a CNN.	GNH outperforms state-of-the-art generalizable human rendering methods, achieving a 31.3% improvement in LPIPS. GNH demonstrates superior rendering quality compared to methods requiring test-time optimization or multi-view inputs. The rendering speed of GNH is 2-7 times faster than baseline generalizable human NeRF methods.	GNH relies on accurate pose and mask estimations for input views, which can impact performance. The model does not account for dynamic lighting changes.	neural rendering, novel view synthesis, human rendering, generalizable rendering, monocular video
2404.14162 Report	FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on	Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, Hongming Shan	Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.	This paper proposes FLDM-VTON, a novel faithful latent diffusion model for virtual try-on that enhances the faithfulness of generated clothing details.	Existing latent diffusion model-based virtual try-on methods often produce unfaithful clothing details due to the stochastic nature of diffusion models and latent supervision.	FLDM-VTON incorporates warped clothes as priors, introduces a clothes flattening network for clothes-consistent supervision, and employs clothes-posterior sampling for faithful inference.	FLDM-VTON outperforms state-of-the-art baselines on VITON-HD and Dress Code datasets, demonstrating superior performance in generating realistic try-on images with faithful clothing details. The proposed method effectively preserves complex style, pattern, and text on clothes, addressing limitations of previous approaches. Ablation studies validate the contribution of each proposed component to the overall performance.	FLDM-VTON may struggle with preserving extremely small or complex logos and patterns due to information loss during the latent diffusion process. Future work could explore diffusion in pixel space or utilize a more robust pre-trained LDM to address this limitation.	virtual try-on, diffusion models, faithful image generation, clothes-consistent supervision, posterior sampling
2404.14132 Report	CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task	Kangzhen Yang, Tao Hu, Kexin Dai, Genggeng Chen, Yu Cao, Wei Dong, Peng Wu, Yanning Zhang, Qingsen Yan	In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.	This paper proposes Composite Refinement Network (CRNet), a novel architecture for unified image restoration and enhancement using multiple exposure images, which effectively restores high-frequency details and outperforms previous state-of-the-art methods.	Existing methods often focus on individual image restoration or enhancement tasks and fail to adequately enhance high-frequency details, leading to unsatisfactory results. CRNet addresses this gap by unifying these tasks and improving high-frequency detail restoration.	CRNet aligns multiple exposure images using optical flow, separates high and low-frequency information using pooling layers, and employs Multi-Branch Blocks for effective fusion. It also utilizes a Convolutional Enhancement Block with large kernel convolutions and an inverted bottleneck ConvFFN to enhance feature fusion and increase the receptive field.	CRNet achieves state-of-the-art performance on the Bracketing Image Restoration and Enhancement Challenge dataset, surpassing previous methods in both visual quality and evaluation metrics. Ablation studies demonstrate the effectiveness of each module in CRNet, highlighting the importance of frequency separation, Multi-Branch Blocks, and the Convolutional Enhancement Block. CRNet secured third place in track 1 of the Bracketing Image Restoration and Enhancement Challenge, exhibiting significantly lower computational costs compared to other top-ranking models.	The model's performance could be further investigated on a wider range of real-world datasets with diverse degradation types. Exploring alternative frequency separation and fusion techniques may lead to further improvements in image quality.	image restoration, image enhancement, high dynamic range (hdr) imaging, deep learning, multi-exposure fusion
2404.14055 Report	RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification	Hai Ci, Pei Yang, Yiren Song, Mike Zheng Shou	We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks. We conduct an in-depth study on it and reveal that the distribution shift unintentionally introduced by the watermarking process, apart from watermark pattern matching, contributes to its exceptional robustness. Our investigation further exposes inherent flaws in its original design, particularly in its ability to identify multiple distinct keys, where distribution shift offers no assistance. Based on these findings and analysis, we present RingID for enhanced multi-key identification. It consists of a novel multi-channel heterogeneous watermarking approach designed to seamlessly amalgamate distinctive advantages from diverse watermarks. Coupled with a series of suggested enhancements, RingID exhibits substantial advancements in multi-key identification. Github Page: https://github.com/showlab/RingID	This paper revisits Tree-Ring Watermarking and identifies an overlooked factor contributing to its robustness: distribution shift introduced during watermark imprinting. The paper further reveals vulnerabilities in Tree-Ring's ability to identify multiple keys, particularly under image transformations like rotation and cropping/scaling, and proposes RingID, an enhanced watermarking method for improved multi-key identification.	Identifying the source and authenticity of AI-generated images, especially with the rise of advanced diffusion models, is crucial for copyright protection and combating malicious uses.	The authors analyze the impact of distribution shift on Tree-Ring's performance under different attacks. They propose RingID, which leverages a multi-channel heterogeneous watermarking framework, discretization, and lossless imprinting for enhanced distinguishability and robustness.	Distribution shift, stemming from discarding the imaginary part during watermarking, significantly contributes to Tree-Ring's robustness in verification tasks, particularly against rotation and cropping/scaling. Tree-Ring shows limited effectiveness in identifying multiple keys, particularly under attacks. RingID significantly outperforms Tree-Ring in multi-key identification while maintaining comparable image generation quality.	Both Tree-Ring and RingID remain vulnerable to cropping and scaling attacks in multi-key identification scenarios. Future work could explore different transform domains for enhanced robustness against cropping and scaling.	diffusion models, tree-ring watermarking, multi-key identification, watermarking, copyright protection
2404.14044 Report	HashPoint: Accelerated Point Searching and Sampling for Neural Rendering	Jiahao Ma, Miaomiao Liu, David Ahmedt-Aristizaba, Chuong Nguyen	In this paper, we address the problem of efficient point searching and sampling for volume neural rendering. Within this realm, two typical approaches are employed: rasterization and ray tracing. The rasterization-based methods enable real-time rendering at the cost of increased memory and lower fidelity. In contrast, the ray-tracing-based methods yield superior quality but demand longer rendering time. We solve this problem by our HashPoint method combining these two strategies, leveraging rasterization for efficient point searching and sampling, and ray marching for rendering. Our method optimizes point searching by rasterizing points within the camera's view, organizing them in a hash table, and facilitating rapid searches. Notably, we accelerate the rendering process by adaptive sampling on the primary surface encountered by the ray. Our approach yields substantial speed-up for a range of state-of-the-art ray-tracing-based methods, maintaining equivalent or superior accuracy across synthetic and real test datasets. The code will be available at https://jiahao-ma.github.io/hashpoint/.	Presents HashPoint, a novel method that combines rasterization and ray tracing for efficient point searching and adaptive sampling in neural rendering.	Addresses the limitations of existing point cloud rendering methods that are either fast but low-fidelity (rasterization-based) or high-quality but slow (ray-tracing-based).	Transforms the 3D point cloud search to a 2D image plane for efficient hash table lookup and introduces adaptive primary surface sampling based on distance to the viewpoint and point cloud distribution.	Achieves up to 80x speedup compared to existing ray-tracing methods like Point-NeRF while maintaining similar visual quality. Outperforms traditional point cloud search methods (Uniform Grid, K-d tree, Octree) in efficiency for ray casting. Demonstrates robust performance on various datasets (Synthetic-NeRF, Waymo, Replica, ShapeNet).	Current implementation requires multi-surface sampling during initial optimization due to gradient propagation issues. The \beta parameter, controlling sampling scope, is fixed and could be dynamically adjusted based on geometry noise and optimization progress in future work.	neural rendering, point cloud, ray tracing, rasterization, adaptive sampling
2404.14037 Report	GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting	Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu	Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.	GaussianTalker, a novel audio-driven talking head synthesis framework using 3D Gaussian Splatting bound to the FLAME model, generates realistic videos with accurate lip synchronization.	Existing methods struggle with unnatural lip movements, visual jitters, and artifacts due to limitations in pose and expression control with implicit representations like NeRF.	GaussianTalker uses a Speaker-specific Motion Translator for natural lip movements by decoupling identity information and using personalized embeddings. It also employs a Dynamic Gaussian Renderer with Speaker-specific BlendShapes to refine facial details and enhance visual realism.	Outperforms state-of-the-art methods in image quality (PSNR, SSIM, LPIPS, FID) and lip synchronization (LMD, LSE-C, LSE-D). Achieves ultra-high rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, enabling real-time performance. Demonstrates strong generalization capability across different speakers, languages, and audio inputs.	The lack of teeth in the original FLAME model necessitates manual additions, which may not fully capture dental details. Further exploration is needed to extend the approach beyond talking head synthesis, capturing a wider range of body movements and expressions.	talking head synthesis, 3d gaussian splatting, speaker-specific, facial animation, real-time rendering
2404.14007 Report	Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting	Weili Zeng, Yichao Yan, Qi Zhu, Zhuo Chen, Pengzhi Chu, Weiming Zhao, Xiaokang Yang	Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.	This paper presents "Infusion," a text-to-image customization method that leverages the generative capabilities of foundational models while mitigating concept overfitting.	Existing T2I customization methods struggle with concept overfitting, which limits their ability to generate diverse and imaginative images that incorporate specific visual concepts.	Infusion decouples attention maps and value features in cross-attention modules. It preserves the foundational model's attention maps for layout and posture diversity, while learning residual value embeddings for customized concepts.	Infusion demonstrates superior performance in generating imaginative and concept-faithful images compared to state-of-the-art methods. It effectively mitigates both concept-agnostic and concept-specific overfitting. Infusion offers a lightweight and plug-and-play solution for single- and multi-concept customization.	Infusion might face limitations in preserving intricate textures when high fidelity is required. Future work could explore training strategies that optimize the balance between diversity and fidelity for specific customization tasks.	text-to-image generation, t2i customization, concept overfitting, diffusion models, cross-attention
2404.13984 Report	RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance	Chengrui Wang, Pengfei Liu, Min Zhou, Ming Zeng, Xubin Li, Tiezheng Ge, Bo zheng	Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. Some previous works mitigate the problem by considering hand structure yet struggle to maintain style consistency between refined malformed hands and other image regions. In this paper, we aim to solve the problem of inconsistency regarding hand structure and style. We propose a conditional diffusion-based framework RHanDS to refine the hand region with the help of decoupled structure and style guidance. Specifically, the structure guidance is the hand mesh reconstructed from the malformed hand, serving to correct the hand structure. The style guidance is a hand image, e.g., the malformed hand itself, and is employed to furnish the style reference for hand refining. In order to suppress the structure leakage when referencing hand style and effectively utilize hand data to improve the capability of the model, we build a multi-style hand dataset and introduce a twostage training strategy. In the first stage, we use paired hand images for training to generate hands with the same style as the reference. In the second stage, various hand images generated based on the human mesh are used for training to enable the model to gain control over the hand structure. We evaluate our method and counterparts on the test dataset of the proposed multi-style hand dataset. The experimental results show that RHanDS can effectively refine hands structure- and style- correctly compared with previous methods. The codes and datasets will be available soon.	RHanDS, a novel diffusion-based framework that refines malformed hands in generated images by leveraging decoupled structure and style guidance.	Existing diffusion models struggle to generate hands with correct structures while maintaining style consistency with the rest of the image.	RHanDS uses a two-stage training strategy: first learning style guidance from paired hand images and then learning structure guidance from hand-mesh pairs. It utilizes a hand mesh reconstructed from the malformed hand for structure guidance and a separate hand image for style guidance.	RHanDS effectively refines hands with correct structure and consistent style compared to previous methods. A user study confirms that RHanDS produces more preferred results with better style consistency and structure quality. The two-stage training strategy is crucial for achieving both accurate structure and style preservation.	RHanDS may struggle with specific styles or complex hand configurations, such as hands wearing gloves or holding objects. Automatic hand mesh reconstruction can fail in some cases, requiring manual intervention.	malformed hand refining, diffusion models, conditional generation, hand structure, hand style
2404.13944 Report	Gorgeous: Create Your Desired Character Facial Makeup from Any Ideas	Jia Wei Sii, Chee Seng Chan	Contemporary makeup transfer methods primarily focus on replicating makeup from one face to another, considerably limiting their use in creating diverse and creative character makeup essential for visual storytelling. Such methods typically fail to address the need for uniqueness and contextual relevance, specifically aligning with character and story settings as they depend heavily on existing facial makeup in reference images. This approach also presents a significant challenge when attempting to source a perfectly matched facial makeup style, further complicating the creation of makeup designs inspired by various story elements, such as theme, background, and props that do not necessarily feature faces. To address these limitations, we introduce $Gorgeous$, a novel diffusion-based makeup application method that goes beyond simple transfer by innovatively crafting unique and thematic facial makeup. Unlike traditional methods, $Gorgeous$ does not require the presence of a face in the reference images. Instead, it draws artistic inspiration from a minimal set of three to five images, which can be of any type, and transforms these elements into practical makeup applications directly on the face. Our comprehensive experiments demonstrate that $Gorgeous$ can effectively generate distinctive character facial makeup inspired by the chosen thematic reference images. This approach opens up new possibilities for integrating broader story elements into character makeup, thereby enhancing the narrative depth and visual impact in storytelling.	$Gorgeous$, a novel diffusion-based makeup application method that creates unique and thematic facial makeup from a minimal set of 3-5 reference images, regardless of whether the images contain faces.	Existing makeup transfer methods are limited to replicating existing makeup looks from source faces, hindering creativity and diversity in character design for visual storytelling.	Gorgeous uses three components: (i) MaFor Module: learns makeup knowledge and preserves facial identity using ControlNet; (ii) CSL Module: encodes artistic elements from reference images into text embeddings using textual inversion; (iii) MaIP Pipeline: combines MaFor and CSL to apply makeup seamlessly on the face using an inpainting-like approach.	Gorgeous generates more unique and diverse character facial makeups compared to traditional makeup transfer methods. Gorgeous can effectively adapt makeup styles from non-facial images, overcoming the limitations of existing methods relying on face parsing. User study (N=100) showed a strong preference for makeups generated by Gorgeous, highlighting its ability to generate appealing and relevant character makeups.	Current evaluation metrics for makeup assessment are limited, focusing on global style rather than makeup-specific nuances like color accuracy and texture fidelity. Future work will focus on developing new metrics specifically designed to evaluate makeup style similarity, considering factors like color harmony, textural alignment, and contextual relevance.	makeup generation, character design, diffusion models, textual inversion, image inpainting
2404.13923 Report	MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets	Zeyu Li, Ruitong Gan, Chuanchen Luo, Yuxi Wang, Jiaheng Liu, Ziwei Zhu Man Zhang, Qing Li, Xucheng Yin, Zhaoxiang Zhang, Junran Peng	Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space. However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture. As a result, material maps optimized by SDS inevitably involve spurious correlated components. The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. Based on such a prior model, we devise a mechanism to parse material in 3D space. We maintain a UV stack, each map of which is unprojected from a specific viewpoint. After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts. To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.	This paper introduces MaterialSeg3D, a novel workflow that leverages 2D material priors to generate accurate and realistic surface materials for 3D assets, addressing the limitations of existing methods that struggle with realistic material generation.	High-quality PBR materials are crucial for 3D assets to appear realistic under various lighting conditions, but existing 3D asset generation methods often lack accurate material information or struggle to generate realistic materials.	The method employs a multi-view rendering approach, generating images of the 3D asset from various angles. These renderings are then fed into a material segmentation model trained on a novel dataset called Materialized Individual Objects (MIO). This dataset contains single-object images with dense material semantic annotations. Finally, the predicted material labels from different views are projected back onto the UV map and fused using a weighted voting mechanism.	MaterialSeg3D effectively generates accurate and realistic surface materials for 3D assets, outperforming existing methods. The proposed MIO dataset, with its diverse camera angles and material annotations, proves valuable for training the material segmentation model. The weighted voting mechanism effectively combines material predictions from different views, ensuring accurate material assignment on the 3D asset's surface.	The current implementation relies on 3D assets with pre-existing Albedo UV maps, limiting its applicability to assets without such information. The quality of the generated surface material is influenced by the quality of the input mesh; low-quality meshes can lead to less accurate results.	3d asset generation, surface material generation, material segmentation, multi-view rendering, pbr materials
2404.13903 Report	Accelerating Image Generation with Sub-path Linear Approximation Model	Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang	Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.	This paper introduces SLAM (Sub-path Linear Approximation Model) which accelerates diffusion models while preserving high-quality image generation.	Diffusion models, despite impressive results, suffer from slow inference speed, hindering practical use. SLAM addresses this by accelerating generation without compromising quality.	SLAM divides the Probability Flow ODE trajectory into sub-paths and approximates them with linear ODEs. This allows for a more nuanced optimization of denoising mappings, reducing cumulative errors.	SLAM outperforms existing acceleration methods in few-step generation on FID and image quality across LAION, MS COCO 2014, and MS COCO 2017 datasets. The method exhibits efficient training, needing only 6 A100 GPU days for a high-quality generative model capable of 2 to 4-step generation. SLAM consistently achieves smaller denoising mapping errors compared to methods like LCM, especially at larger timesteps, as evidenced by quantitative analysis.	The paper primarily focuses on text-to-image generation, leaving exploration of other modalities for future work. While SLAM mitigates the limitations of large skipping step sizes, further investigation into optimal step size selection strategies is warranted.	diffusion models, accelerating diffusion models, diffusion model distillation, consistency models, image generation
2404.13896 Report	CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory	Yunlong Ran, Yanxu Li, Qi Ye, Yuchi Huo, Zechun Bai, Jiahao Sun, Jiming Chen	Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy.	This paper proposes CT-NeRF, an incremental reconstruction optimization pipeline that jointly optimizes neural radiance fields and camera poses using only RGB images, particularly addressing challenges in scenes with complex trajectories involving large rotations.	Existing NeRF-based methods often struggle with complex trajectories due to reliance on precise camera poses or limitations in handling large rotations. This work aims to address this gap and enable accurate 3D scene reconstruction in such challenging scenarios.	The method introduces a local-global bundle adjustment with pose graphs connecting neighboring frames, enforcing pose consistency beyond just the scene structure. A reprojected geometric image distance constraint, derived from learned correspondences between image pairs, is used to robustly optimize poses and scene geometry.	CT-NeRF significantly outperforms state-of-the-art methods in pose estimation accuracy on datasets with complex trajectories, as demonstrated by lower rotation and translation errors. The method achieves high-quality novel view synthesis, even in challenging scenarios with arbitrary trajectory variations and reduced frame overlap. Ablation studies validate the importance of each component, particularly the reprojection loss and the incremental optimization strategy, in achieving accurate and robust results.	The current work explores simple pose graphs, and investigating more sophisticated graph optimization techniques could be beneficial for very long trajectories. The paper highlights the need for dedicated evaluation datasets, protocols, and metrics specifically designed for complex camera trajectories to better assess reconstruction quality.	neural radiance fields, pose estimation, structure from motion, incremental optimization, complex trajectories
2404.13816 Report	Neural Radiance Field in Autonomous Driving: A Survey	Lei He, Leheng Li, Wenchao Sun, Zeyu Han, Yichen Liu, Sifa Zheng, Jianqiang Wang, Keqiang Li	Neural Radiance Field (NeRF) has garnered significant attention from both academia and industry due to its intrinsic advantages, particularly its implicit representation and novel view synthesis capabilities. With the rapid advancements in deep learning, a multitude of methods have emerged to explore the potential applications of NeRF in the domain of Autonomous Driving (AD). However, a conspicuous void is apparent within the current literature. To bridge this gap, this paper conducts a comprehensive survey of NeRF's applications in the context of AD. Our survey is structured to categorize NeRF's applications in Autonomous Driving (AD), specifically encompassing perception, 3D reconstruction, simultaneous localization and mapping (SLAM), and simulation. We delve into in-depth analysis and summarize the findings for each application category, and conclude by providing insights and discussions on future directions in this field. We hope this paper serves as a comprehensive reference for researchers in this domain. To the best of our knowledge, this is the first survey specifically focused on the applications of NeRF in the Autonomous Driving domain.	This paper presents the first comprehensive survey of Neural Radiance Fields (NeRF) applications in autonomous driving, encompassing perception, 3D reconstruction, SLAM, and simulation.	NeRF's implicit representation and novel view synthesis capabilities hold significant potential for enhancing autonomous driving technologies, prompting a surge of research in this area.	The authors systematically categorize and analyze existing NeRF-based methods across various autonomous driving applications, summarizing key features and limitations.	NeRF proves valuable for data augmentation in perception tasks, generating realistic training data and mitigating the sim-to-real gap. In 3D reconstruction, NeRF facilitates dynamic scene reconstruction, surface reconstruction, and inverse rendering, enabling applications like relighting and object insertion. NeRF-based SLAM methods demonstrate progress in pose estimation, scene representation, and handling depth uncertainty, with applications in localization and mapping.	Current NeRF-based methods for autonomous driving often face computational challenges, particularly in high-dynamic scenarios and large-scale environments. Further research is needed to address limitations in reconstructing non-rigid objects, handling severe light conditions, and ensuring real-time performance in complex driving scenarios.	neural radiance fields, autonomous driving, perception, 3d reconstruction, slam, simulation
2404.13784 Report	Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images	Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr	With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.	This paper introduces a novel attack strategy using multi-modal models to generate images similar to those in AI-generated image marketplaces and stock photo websites, at a fraction of the cost.	This work exposes a vulnerability in the digital imagery landscape, highlighting the economic and security implications of AI-generated images and the potential for misuse.	The proposed method utilizes a fine-tuned CLIP model, a multi-label classifier for extracting keywords and modifiers, and GPT-4V for generating refined prompts based on image analysis.	The attack successfully generates comparable images for a significantly lower cost (\$0.23 - \$0.27 per image). The method outperforms baseline models like BLIP2 and CLIP Interrogator in image similarity tests. A large-scale dataset of 19 million prompt-image pairs from Midjourney was collected and will be publicly released.	The success of the attack relies heavily on the performance of individual components (e.g., CLIP, GPT-4V), which can be unpredictable. Future work can explore the refinement of each component and investigate the generalization of the attack to other text-to-image models.	ai-generated images, text-to-image synthesis, prompt engineering, digital image integrity, multi-modal learning
2404.13766 Report	Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control	Maria Mihaela Trusca, Wolf Nuyts, Jonathan Thomm, Robert Honig, Thomas Hofmann, Tinne Tuytelaars, Marie-Francine Moens	Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.	This paper proposes two novel training-free methods, focused cross-attention (FCA) and disentangled CLIP encoding (DisCLIP), for improving object-attribute binding in text-to-image synthesis by leveraging syntactic structure of text prompts.	Existing diffusion models excel at generating photorealistic images but struggle to accurately bind attributes to objects in multi-object text prompts, leading to incorrect or nonsensical image generation.	FCA utilizes syntactic dependencies to focus attribute attention within corresponding object regions during image generation. DisCLIP generates disentangled text prompt representations using a constituency tree encoding compositional information and object-attribute bindings. Both methods are seamlessly integrated into existing diffusion models without requiring retraining.	FCA and DisCLIP effectively improve object-attribute binding and reduce attribute leakage as evidenced by improved performance on DAA-200, CC-500, and AE-276 benchmarks. A novel evaluation metric, EPViT, based on a ViT model trained to predict image-graph alignment, outperforms CLIP in assessing object-attribute binding accuracy. Integration of FCA and DisCLIP into various state-of-the-art diffusion models consistently enhances their performance without degrading image quality on general text prompts.	Current EPViT training and FCA application focus solely on object-attribute binding, with potential for expansion to other syntactic relationships. The effectiveness of the proposed methods depends on the accuracy and expressiveness of syntactic parsers, potentially limiting performance when dealing with complex linguistic structures or languages with limited parsing capabilities.	text-to-image synthesis, diffusion models, object-attribute binding, syntactic structure, image-text alignment
2404.13706 Report	Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models	Vitali Petsiuk, Kate Saenko	Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models. Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised. Project page: https://cs-people.bu.edu/vpetsiuk/arc	This paper presents ARC (ARithmetics in Concept space) attacks, a novel method to circumvent concept inhibition in text-to-image diffusion models by exploiting the models' compositional properties.	Concept inhibition is crucial for preventing the misuse of diffusion models for generating harmful or copyrighted content. This work exposes vulnerabilities in existing inhibition techniques, highlighting the need for more robust solutions.	The authors leverage the linearity of conditional guidance in diffusion models. They design attacks that use compositional inference with carefully crafted prompts to reconstruct the erased concept's guidance vector, effectively bypassing the inhibition.	ARC attacks significantly increase the reproduction rates of inhibited concepts, even when tested against various state-of-the-art inhibition methods. The attacks are straightforward to implement, requiring only black-box access to the model's compositional inference. The findings demonstrate that local modifications to the model's weights are insufficient for robust concept inhibition.	The work primarily focuses on demonstrating the existence and effectiveness of such attacks. Further research is needed to explore optimal attack strategies and defenses. The study focuses on a limited set of concepts and inhibition methods. Evaluating the attacks on a broader range of concepts and models is important future work.	diffusion models, concept inhibition, adversarial attacks, text-to-image generation, compositional inference
2404.13696 Report	Clio: Real-time Task-Driven Open-Set 3D Scene Graphs	Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, Lukas Schmid, Luca Carlone	Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.	This paper presents Clio, a real-time system that builds task-driven 3D scene graphs with open-set semantics, clustering 3D primitives into task-relevant objects and regions using an Information Bottleneck approach.	Current methods for building semantic maps are limited to a fixed set of concepts and don't consider the task-dependency of choosing relevant semantic concepts, which is crucial for robot perception.	The paper leverages vision-language models (VLMs) like CLIP and task-agnostic segmentation (e.g., SegmentAnything) to cluster 3D primitives using an incremental Agglomerative Information Bottleneck algorithm, enabling real-time operation.	Clio constructs more compact and useful scene representations compared to task-agnostic methods, retaining only task-relevant objects and regions. It achieves comparable performance to state-of-the-art methods in closed-set object detection tasks, demonstrating its efficacy in both open and closed-set settings. Clio enables real-time onboard mapping and supports mobile manipulation tasks on a Spot robot, showcasing its practicality for robotics applications.	The approach inherits limitations from the foundation models used, such as vulnerability to prompt tuning. Current implementation uses simple averaging to merge semantic descriptions of primitives, and extending it to handle more complex, multi-step tasks is desirable.	3d scene understanding, robotics, information bottleneck, vision-language models, open-set recognition
2404.13686 Report	Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis	Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao	Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and +0.51 in Aes Score in the 1-step inference.	Hyper-SD, a novel framework combining ODE Trajectory Preservation and Reformulation, accelerates diffusion models (SDXL and SD1.5) while maintaining near-lossless performance during step compression	Diffusion models, though powerful for Generative AI, suffer from high computational cost due to multi-step inference. Existing distillation methods for acceleration either compromise generation quality or introduce domain shifts	The framework leverages: (1) Trajectory Segmented Consistency Distillation for progressive, fine-grained distillation; (2) Human feedback learning to optimize the model for few-step inference; (3) Score distillation for enhanced one-step generation and a unified LoRA for all inference steps	Hyper-SD achieves SOTA performance for both SDXL and SD1.5 in low-step inference (1 to 8 steps) across quantitative metrics and user studies. Hyper-SD maintains better image quality and text-image alignment than competing methods, especially for SD15 with limited model capacity. Hyper-SD is compatible with ControlNet, various base models, and supports flexible inference with a unified LoRA.	Current acceleration methods, including Hyper-SD, eliminate Classifier Free Guidance, limiting control with negative prompts. The use of generic reward models for human feedback can be further improved by customized ones for accelerated models.	diffusion models, model acceleration, distillation, human feedback learning, generative ai
2404.13680 Report	PoseAnimate: Zero-shot high fidelity pose controllable character animation	Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi	Image-to-video(I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity with the source image.However, existing approaches suffer from character appearance inconsistency and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding.To address these limitations,we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.	PoseAnimate: A zero-shot, reconstruction-based I2V framework for character animation that generates high-quality videos of arbitrary character images performing user-defined pose sequences.	Existing I2V methods suffer from appearance inconsistency, poor detail preservation, and high computational cost due to training requirements. This work explores a training-free approach for efficient and high-fidelity character animation.	The framework leverages a novel pose-aware control module (PACM) to optimize embeddings for pose alignment while maintaining scene consistency. It incorporates a dual consistency attention module (DCAM) for temporal coherence and identity preservation, further enhanced by a mask-guided decoupling module (MGDM) for refined detail perception.	Outperforms state-of-the-art training-based methods in character consistency and detail fidelity. Demonstrates superior preservation of complex fine-grained details and temporal coherence. Achieves high-quality animation without requiring training, leading to lower computational overhead.	Reliance on pre-trained models might limit generalization ability to unseen domains. Further exploration of handling complex interactions between character and background.	image animation, character animation, zero-shot learning, diffusion models, pose control
2404.13679 Report	GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal	Yuxin Wang, Qianyi Wu, Guofeng Zhang, Dan Xu	This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.	This paper presents GScream, a novel framework for efficient and effective 3D object removal from pre-captured scenes using 3D Gaussian Splatting (3DGS).	Existing methods for 3D object removal based on Neural Radiance Fields (NeRF) suffer from slow training and rendering speeds, while standard 3DGS methods lack geometric accuracy and texture coherence needed for object removal. This work addresses these limitations.	GScream leverages monocular depth estimation as extra supervision for improving geometric consistency and introduces a novel feature propagation mechanism based on cross-attention between Gaussians in visible and in-painted regions to enhance texture coherence.	GScream achieves comparable or superior performance to state-of-the-art NeRF-based methods in terms of visual quality while achieving significantly faster training speeds (1.5x to 4x faster). Monocular depth guidance is shown to significantly improve the geometric accuracy of 3DGS, leading to more realistic object removal. The proposed cross-attention feature regularization effectively propagates texture information from visible to in-painted regions, resulting in enhanced texture coherence and natural-looking object removal.	The reliance on 2D in-painting for the reference view might introduce limitations if the in-painting results are imperfect. Future work could explore joint optimization of 2D in-painting and 3DGS for better overall consistency.	3d object removal, 3d gaussian splatting, neural radiance fields, depth completion, texture propagation
2404.13579 Report	LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions	Xiaoran Zhao, Tianhao Wu, Yu Lai, Zhiliang Tian, Zhen Huang, Yahui Liu, Zejiang He, Dongsheng Li	Controllable text-to-image generation synthesizes visual text and objects in images with certain conditions, which are frequently applied to emoji and poster generation. Visual text rendering and layout-to-image generation tasks have been popular in controllable text-to-image generation. However, each of these tasks typically focuses on single modality generation or rendering, leaving yet-to-be-bridged gaps between the approaches correspondingly designed for each of the tasks. In this paper, we combine text rendering and layout-to-image generation tasks into a single task: layout-controllable text-object synthesis (LTOS) task, aiming at synthesizing images with object and visual text based on predefined object layout and text contents. As compliant datasets are not readily available for our LTOS task, we construct a layout-aware text-object synthesis dataset, containing elaborate well-aligned labels of visual text and object information. Based on the dataset, we propose a layout-controllable text-object adaptive fusion (TOF) framework, which generates images with clear, legible visual text and plausible objects. We construct a visual-text rendering module to synthesize text and employ an object-layout control module to generate objects while integrating the two modules to harmoniously generate and integrate text content and objects in images. To better the image-text integration, we propose a self-adaptive cross-attention fusion module that helps the image generation to attend more to important text information. Within such a fusion module, we use a self-adaptive learnable factor to learn to flexibly control the influence of cross-attention outputs on image generation. Experimental results show that our method outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image tasks, enabling harmonious visual text rendering and object generation.	This paper presents a novel framework, called TOF, for layout-controllable text-object synthesis (LTOS) which aims to generate images with user-controlled object placement and visual text.	Existing text-to-image generation methods struggle to accurately control both object layout and visual text rendering simultaneously, creating a need for an integrated approach.	The TOF framework consists of: (1) An object-layout control module for generating objects at specific locations, (2) a visual-text rendering module for synthesizing text with custom layouts, and (3) a text-object self-adaptive fusion module for balancing text and object generation using adaptive cross-attention.	TOF significantly outperforms state-of-the-art methods in text rendering quality while maintaining object generation accuracy. The proposed LTOS dataset, containing aligned object and visual text annotations, proves valuable for training and evaluating LTOS tasks. Ablation studies confirm the contribution of each component, especially the self-adaptive fusion module.	The current dataset focuses primarily on English text; expanding to other languages is a future goal. Future work will explore incorporating more sophisticated text layouts and styles.	diffusion model, text rendering, multi-modal generation, text-object synthesis, layout-to-image generation
2404.13573 Report	Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap	Bowen Qu, Xiaoyu Liang, Shangkun Sun, Wei Gao	The recent advancements in Text-to-Video Artificial Intelligence Generated Content (AIGC) have been remarkable. Compared with traditional videos, the assessment of AIGC videos encounters various challenges: visual inconsistency that defy common sense, discrepancies between content and the textual prompt, and distribution gap between various generative models, etc. Target at these challenges, in this work, we categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap. For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos. Furthermore, our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models. Predicting the source generative model can make the AIGC video features more discriminative, which enhances the quality assessment performance. The proposed method was used in the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video, demonstrating its effectiveness. Code will be available at https://github.com/Coobiw/TriVQA.	This paper presents a novel framework for assessing AI-Generated Content (AIGC) video quality, addressing the unique challenges posed by this new type of video.	Existing video quality assessment methods fall short in evaluating AIGC videos due to their unique characteristics, such as visual inconsistencies, discrepancies between content and textual prompts, and variations across generative models.	The proposed framework decouples AIGC video quality assessment into three dimensions: visual harmony, video-text consistency, and domain distribution gap. It employs a dual-stream architecture with explicit prompt injection, implicit text guidance, caption similarity, and auxiliary inter-domain classification.	The method outperforms state-of-the-art VQA methods on the NTIRE 2024 AIGC Video Quality Assessment dataset. Explicit prompt injection, implicit text guidance, and auxiliary inter-domain classification are shown to significantly improve performance. The proposed method secured the third-place position in the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video Challenge.	The reliance on a limited AIGC video dataset may not fully encompass the diversity of future generative models. Future work could explore expanding the dataset with samples from a wider range of T2V models.	aigc, video quality assessment, text-to-video, multimodal learning, domain gap
2404.13445 Report	DMesh: A Differentiable Mesh Representation	Sanghyun Son, Matheus Gadelha, Yang Zhou, Zexiang Xu, Ming C. Lin, Yi Zhou	We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and select triangular faces on the tetrahedra to define the final mesh. We formulate probability of faces to exist on the actual surface in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: https://sonsang.github.io/dmesh-project.	DMesh, a differentiable representation for general 3D triangular meshes, which considers both geometry and connectivity and enables gradient-based optimization of mesh topology and features.	Existing differentiable mesh representations are limited by fixed topology or reliance on intermediate forms, leading to challenges in representing diverse geometries.	Utilizes differentiable Weighted Delaunay Triangulation (WDT) to divide a convex domain into tetrahedra, selecting a subset of triangular faces from them to define the final mesh, and formulates the probability of faces existing on the actual surface in a differentiable manner.	DMesh is versatile and can represent meshes of various topologies, including non-convex polyhedra, non-orientable geometries, and complex structures. A computationally efficient approach to differentiable WDT is proposed, running in approximately linear time compared to the exponential cost of previous methods. DMesh allows for efficient reconstruction of surfaces from point clouds and multi-view images, resulting in compact and accurate meshes.	Current DMesh resolution is limited by computational cost, particularly due to WDT construction. While DMesh generalizes well to various mesh connectivities, it can exhibit non-manifold errors, requiring further research to guarantee manifoldness.	differentiable mesh representation, weighted delaunay triangulation, mesh reconstruction, point cloud reconstruction, multi-view reconstruction
2404.13400 Report	HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding	Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu	Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.	This paper presents HiVG, a hierarchical multimodal fine-grained modulation framework that effectively adapts a pre-trained CLIP model for visual grounding.	Existing visual grounding methods suffer from task gaps between pre-training and grounding, particularly data bias and differences in learning objectives. This work aims to address these gaps by leveraging the power of multimodal pre-training.	HiVG consists of two main components: (1) a multi-layer adaptive cross-modal bridge to align visual and textual features and (2) a hierarchical low-rank adaptation (Hi LoRA) paradigm for efficient fine-tuning of the pre-trained model.	HiVG achieves state-of-the-art performance on five benchmark datasets, outperforming both CLIP-based and detector-based methods. The proposed Hi LoRA paradigm enables efficient adaptation with minimal trainable parameters while maintaining high performance. HiVG exhibits strong semantic comprehension capabilities, achieving superior results on grounding tasks involving complex and lengthy text descriptions.	The performance of HiVG with a Beit-3 backbone, while improved by Hi LoRA, is still lower than that of CLIP, indicating potential limitations in generalizing to other pre-trained models. Future work could investigate adaptive selection of layer groups and LoRA stages for enhanced hierarchical adaptation.	visual grounding, referring expression comprehension, multimodal learning, low-rank adaptation, hierarchical learning
2404.13370 Report	Movie101v2: Improved Movie Narration Benchmark	Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin	Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.	This paper introduces Movie101v2, a large-scale, bilingual dataset for movie narration generation, building upon and improving the original Movie101 dataset.	Automatic movie narration is crucial for visually impaired audiences but remains challenging due to the need to describe visual details and infer plots. Existing datasets have limitations such as small scale, single language, and short, simple clips.	The authors collected 102 additional movies with narrations, used ASR and LLMs for text processing, and enhanced data quality by completing and correcting character names. They defined three progressive stages for movie narration: visual fact description (L1), plot reasoning and narration (L2), and applicable AD text generation (L3). A new evaluation framework using LLMs assesses L1 and L2 separately.	Movie101v2 consists of 203 movies and 46K bilingual video-narration pairs, exceeding the scale of existing datasets. Baseline models, including GPT-4V, show promising results but still struggle with L2-level plot reasoning and narration. Analysis reveals challenges in visual perception, particularly character action/emotion recognition and face matching, and text generation due to the complexity of narration language.	The current work focuses on L1 and L2, leaving the more complex L3 for future exploration. The analysis mainly focuses on GPT-4V, limiting insights into other models' limitations and potential solutions.	movie narration, video understanding, multi-modal, dataset, large vision-language models
2404.13320 Report	Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We Think	Haotian Xue, Yongxin Chen	Adversarial examples for diffusion models are widely used as solutions for safety concerns. By adding adversarial perturbations to personal images, attackers can not edit or imitate them easily. However, it is essential to note that all these protections target the latent diffusion model (LDMs), the adversarial examples for diffusion models in the pixel space (PDMs) are largely overlooked. This may mislead us to think that the diffusion models are vulnerable to adversarial attacks like most deep models. In this paper, we show novel findings that: even though gradient-based white-box attacks can be used to attack the LDMs, they fail to attack PDMs. This finding is supported by extensive experiments of almost a wide range of attacking methods on various PDMs and LDMs with different model structures, which means diffusion models are indeed much more robust against adversarial attacks. We also find that PDMs can be used as an off-the-shelf purifier to effectively remove the adversarial patterns that were generated on LDMs to protect the images, which means that most protection methods nowadays, to some extent, cannot protect our images from malicious attacks. We hope that our insights will inspire the community to rethink the adversarial samples for diffusion models as protection methods and move forward to more effective protection. Codes are available in https://github.com/xavihart/PDM-Pure.	This paper reveals that Pixel Diffusion Models (PDMs) are significantly more robust against adversarial attacks than commonly believed, contrary to the vulnerability observed in Latent Diffusion Models (LDMs).	This finding challenges the existing assumption that diffusion models are easily fooled by adversarial attacks and has important implications for the security and protection of these models.	The authors conduct extensive experiments on various LDMs and PDMs with different architectures, datasets, and resolutions. They test existing attack methods and evaluate the robustness of PDMs.	Existing adversarial attack methods designed for LDMs fail to effectively attack PDMs. PDMs exhibit strong robustness against adversarial perturbations, even with large perturbation budgets. A new purification method, PDM-Pure, leverages the robustness of PDMs to effectively remove protective perturbations from images, bypassing existing protection methods.	The study primarily focuses on image-based diffusion models, and further investigation is needed for other modalities. The purification effectiveness of PDM-Pure may vary depending on the strength and type of adversarial perturbations.	diffusion models, adversarial attacks, robustness, image protection, purification
2404.13306 Report	FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models	Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, Weisi Lin	Recently, fake images generated by artificial intelligence (AI) models have become indistinguishable from the real, exerting new challenges for fake image detection models. To this extent, simple binary judgments of real or fake seem less convincing and credible due to the absence of human-understandable explanations. Fortunately, Large Multimodal Models (LMMs) bring possibilities to materialize the judgment process while their performance remains undetermined. Therefore, we propose FakeBench, the first-of-a-kind benchmark towards transparent defake, consisting of fake images with human language descriptions on forgery signs. FakeBench gropes for two open questions of LMMs: (1) can LMMs distinguish fake images generated by AI, and (2) how do LMMs distinguish fake images? In specific, we construct the FakeClass dataset with 6k diverse-sourced fake and real images, each equipped with a Question&Answer pair concerning the authenticity of images, which are utilized to benchmark the detection ability. To examine the reasoning and interpretation abilities of LMMs, we present the FakeClue dataset, consisting of 15k pieces of descriptions on the telltale clues revealing the falsification of fake images. Besides, we construct the FakeQA to measure the LMMs' open-question answering ability on fine-grained authenticity-relevant aspects. Our experimental results discover that current LMMs possess moderate identification ability, preliminary interpretation and reasoning ability, and passable open-question answering ability for image defake. The FakeBench will be made publicly available soon.	This paper introduces FakeBench, the first benchmark for evaluating the 'transparent defake' abilities of Large Multimodal Models (LMMs), focusing on whether LMMs can not only detect fake images but also provide human-understandable explanations for their judgments.	With the rise of highly realistic AI-generated fake images, simple binary judgments of 'real' or 'fake' are no longer sufficient. Transparent defake, with its emphasis on human-interpretable explanations, is crucial for building trust and understanding potential model biases.	FakeBench consists of three datasets: FakeClass (for evaluating detection ability), FakeClue (for evaluating reasoning and interpretation abilities), and FakeQA (for evaluating open-ended question answering on authenticity details). The researchers collected diverse fake images and created natural language annotations, including questions, answers, and detailed descriptions of forgery signs. 13 well-known LMMs were then evaluated on these datasets.	Current LMMs show moderate ability in detecting fake images, but their performance varies significantly across different generation models. LMMs exhibit only preliminary abilities in interpreting and reasoning about fake images using human-understandable language. Explicit chain-of-thought reasoning, while generally beneficial in other tasks, does not significantly improve the fake image detection accuracy for most LMMs.	The reasoning ability of LMMs is still limited by their understanding of the real world and their capability to describe image irrationality. Future work should focus on introducing conflict awareness and real-world knowledge to guide LMMs towards better fake image detection and explanation.	large multimodal models, fake image detection, reasoning and interpretation, benchmark, transparent defake
2404.13299 Report	PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition	Xi Fang, Weigang Wang, Xiaoxin Lv, Jun Yan	The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.	This paper introduces PCQA, a unified framework for assessing the quality of AI-generated images and videos by incorporating prompt information as a conditional factor.	With the rise of AIGC, evaluating the quality of generated content, particularly in alignment with the creative intent expressed in prompts, is crucial. Existing UGC quality assessment methods fall short in addressing this need.	The PCQA method leverages a hybrid CLIP text encoder to understand prompts and employs a feature mixer module to blend visual features with adapted prompt representations. The final quality score is obtained through a regression head trained on MOS values.	PCQA significantly outperforms baseline methods on both AIGIQA-20K (image) and T2VQA-DB (video) datasets. Ablation studies demonstrate the benefits of using a hybrid text encoder, feature adapter, and model ensemble techniques. The proposed method secured top rankings in the NTIRE 2024 AIGC quality assessment competition.	The current method resizes images, potentially losing aspect ratio information crucial for aesthetic evaluation. Future work should explore aspect-ratio-preserving techniques. The model's reliance on global average pooling for feature extraction results in a loss of spatial information, potentially limiting its performance in video quality assessment. Future research should investigate incorporating spatial-temporal information.	aigc quality assessment, prompt-conditional quality assessment, clip text encoder, feature mixer, model ensemble
2404.13263 Report	FilterPrompt: Guiding Image Transfer in Diffusion Models	Xi Wang, Yichen Peng, Heng Fang, Haoran Xie, Xi Yang, Chuntao Li	In controllable generation tasks, flexibly manipulating the generated images to attain a desired appearance or structure based on a single input image cue remains a critical and longstanding challenge. Achieving this requires the effective decoupling of key attributes within the input image data, aiming to get representations accurately. Previous research has predominantly concentrated on disentangling image attributes within feature space. However, the complex distribution present in real-world data often makes the application of such decoupling algorithms to other datasets challenging. Moreover, the granularity of control over feature encoding frequently fails to meet specific task requirements. Upon scrutinizing the characteristics of various generative models, we have observed that the input sensitivity and dynamic evolution properties of the diffusion model can be effectively fused with the explicit decomposition operation in pixel space. This integration enables the image processing operations performed in pixel space for a specific feature distribution of the input image, and can achieve the desired control effect in the generated results. Therefore, we propose FilterPrompt, an approach to enhance the model control effect. It can be universally applied to any diffusion model, allowing users to adjust the representation of specific image features in accordance with task requirements, thereby facilitating more precise and controllable generation outcomes. In particular, our designed experiments demonstrate that the FilterPrompt optimizes feature correlation, mitigates content conflicts during the generation process, and enhances the model's control capability.	The paper introduces FilterPrompt, a novel approach that enhances control in diffusion models by manipulating frequency and distribution characteristics of image attributes in pixel space, influencing their representation during generation.	This approach addresses the limitations of feature space manipulation, offering a more intuitive, controllable, and universally applicable method for enhancing control in diffusion models.	FilterPrompt integrates filtering operations with a baseline architecture combining ControlNet and IP-Adapter. It applies filters to input images, guiding the diffusion process by modulating feature expression based on specific tasks.	FilterPrompt excels in preserving structure, shape, and edge similarity, as evidenced by higher SP and lower CD scores. It effectively transfers color distribution and texture features, exhibiting lower FID and GLCM values compared to other methods. The method demonstrates strong performance in both style transfer and appearance transfer tasks across diverse domains.	Designing FilterPrompt necessitates manual adjustments based on specific task requirements and data characteristics, involving a degree of trial and error. While the current study focuses on a specific baseline architecture, integrating FilterPrompt with more advanced diffusion models holds potential for further improvement.	image transfer, controllable generation, diffusion models, explicit decomposition, visual prompt
2404.13153 Report	Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring	Chengxu Liu, Xuan Wang, Xiangyu Xu, Ruhao Tian, Shuai Li, Xueming Qian, Ming-Hsuan Yang	Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper, we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction, and then collaboratively filters the aligned image through the predicted kernels, weights, and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore, we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter	This paper introduces the Motion-adaptive Separable Collaborative (MISC) Filter for blind motion deblurring. It tackles the limitations of previous methods by directly addressing motion blur in the image space instead of solely focusing on feature space.	Existing deblurring methods struggle to handle the spatially varying and complex motion found in real-world scenarios. This method provides a novel approach by estimating spatially-variant motion information and applying a tailored filtering process.	The MISC filter uses a motion estimation network to predict motion flow, mask, kernels, weights, and offsets. It aligns blurring patterns to the motion middle and then collaboratively filters the aligned image using predicted parameters. The paper also analyzes different coupling strategies between the motion estimation network and a residual reconstruction network.	The MISC Filter significantly outperforms state-of-the-art methods on complex real-world motion blur datasets like RealBlur-R and RealBlur-J. The method demonstrates strong performance on the GoPro and HIDE datasets, achieving comparable results while requiring half the runtime of some leading methods. Ablation studies validate the contribution of each component within the MISC Filter and demonstrate the effectiveness of the shared-based network coupling strategy.	The method currently shows limitations in addressing the low-light degradation often present in hardware-induced blurring (e.g., under-display cameras). Further exploration is needed to optimize the MISC filter for broader applicability in scenarios involving both motion and low-light challenges.	motion deblurring, image restoration, misc filter, motion estimation, collaborative filtering
2404.13046 Report	MoVA: Adapting Mixture of Vision Experts to Multimodal Context	Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu	As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at https://github.com/TempleX98/MoVA.	Proposes MoVA, a multimodal large language model that adaptively routes and fuses task-specific vision experts with a coarse-to-fine mechanism to enhance multimodal understanding and generalization.	Existing MLLMs often rely on single vision encoders (e.g., CLIP) that exhibit inconsistent performance across different tasks and domains, limiting their generalization ability.	Uses a context-aware expert routing strategy to select relevant experts based on user input and model expertise, followed by fine-grained expert fusion with MoV-Adapter to extract and integrate task-specific knowledge.	Achieves state-of-the-art performance on various MLLM benchmarks, including MMBench, MME, and QBench. Outperforms specialist models on text-oriented VQA benchmarks while exhibiting strong performance on general VQA and visual grounding tasks. Demonstrates robust generalization capabilities across diverse domains, including medical visual question answering and image segmentation.	The number of experts used for fusion is limited to three to manage computational costs. Exploring alternative expert routing strategies and incorporating more diverse experts could further improve performance.	multimodal large language models, vision encoder, mixture-of-experts, context-aware routing, expert fusion
2404.13044 Report	Unified Scene Representation and Reconstruction for 3D Large Language Models	Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang	Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.	This paper presents \methodname, a unified scene representation and reconstruction framework for enhancing 3D Large Language Models (LLMs).	Existing methods for enabling LLMs to interact with 3D environments suffer from limitations in establishing spatial connections and integrating geometric and semantic information, hindering their performance.	\methodname leverages frozen pre-trained 2D foundation models (CLIP and SAM) and a multi-scale 3D decoder to extract 3D geometric and semantic representations. These representations are then used for both scene reconstruction and as input for the LLM.	\methodname achieves superior 3D reconstruction results on ScanNet, improving F-Score by +1.8% over the baseline. \methodname-LLM surpasses the baseline and state-of-the-art methods on 3D vision-language understanding benchmarks ScanQA and 3DMV-VQA, even without relying on ground truth point clouds. Ablation studies confirm the importance of unified representation and reconstruction, highlighting the contribution of each component to the overall performance.	The current method focuses on indoor scenes and may require adaptation for diverse and complex outdoor environments. Future work will explore scaling the approach to enhance more 3D capabilities with LLMs, including 3D scene perception and generation.	3d reconstruction, 3d representation, large language models, vision-language understanding, 3d vision
2404.13040 Report	Analysis of Classifier-Free Guidance Weight Schedulers	Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton	Classifier-Free Guidance (CFG) enhances the quality and condition adherence of text-to-image diffusion models. It operates by combining the conditional and unconditional predictions using a fixed weight. However, recent works vary the weights throughout the diffusion process, reporting superior results but without providing any rationale or analysis. By conducting comprehensive experiments, this paper provides insights into CFG weight schedulers. Our findings suggest that simple, monotonically increasing weight schedulers consistently lead to improved performances, requiring merely a single line of code. In addition, more complex parametrized schedulers can be optimized for further improvement, but do not generalize across different models and tasks.	This paper investigates the impact of dynamic guidance weight schedulers in Classifier-Free Guidance (CFG) for diffusion models, proposing simple yet effective schedulers to improve image generation quality.	Static guidance weight in CFG often presents a trade-off between detail and sharpness in generated images. Dynamic schedulers have shown promise but lack comprehensive analysis and justification.	The paper explores various heuristic (linear, cosine, etc.) and parameterized (power-cosine, clamping) dynamic schedulers. Their effects are evaluated on class-conditioned image generation (CIFAR-10, ImageNet) and text-to-image generation (Stable Diffusion 1.5 and SDXL) using FID, CLIP-Score, and Diversity metrics.	Monotonically increasing schedulers (linear, cosine) consistently outperform static guidance and decreasing schedulers. A simple linear scheduler significantly improves results without additional computational cost or tuning. Parameterized schedulers, like clamp-linear, can further boost performance but require parameter tuning specific to the model and task.	Optimal parameters for parameterized schedulers do not generalize across models and datasets. Further investigation is needed to understand the theoretical underpinnings of why dynamic schedulers improve performance.	diffusion models, classifier-free guidance, text-to-image generation, dynamic schedulers, image generation
2404.13026 Report	PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation	Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y. Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, William T. Freeman	Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner. See our project page at https://physdreamer.github.io/.	PhysDreamer is a novel method for synthesizing interactive 3D dynamics by imbuing static 3D objects with physically-based material properties learned from video generation models.	Realistic object interaction is crucial for immersive virtual experiences. However, existing methods struggle to generate convincing action-conditioned dynamics that realistically capture how objects respond to external forces.	PhysDreamer leverages the object dynamics priors learned by video generation models. It generates a plausible motion sequence for a static 3D object using a video generation model, then optimizes a spatially varying material field for the object. This optimization leverages differentiable simulation (MPM) and rendering to match the rendered object motion to the generated motion.	PhysDreamer successfully synthesizes realistic interactive dynamics for various elastic objects, including flowers, a plant, a telephone cord, and a beanie hat. User study results show that PhysDreamer significantly outperforms state-of-the-art methods in terms of motion realism and visual quality. The method can benefit from multi-view supervision, improving results for objects with self-occlusion.	The approach requires manual object segmentation and specification of boundary conditions. The method is computationally demanding, requiring further optimization for real-time applications.	3d object interaction, physics-based simulation, video generation, material estimation, differentiable rendering
2404.13024 Report	BANF: Band-limited Neural Fields for Levels of Detail Reconstruction	Ahan Shabanov, Shrisudhan Govindarajan, Cody Reading, Lily Goli, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi	Largely due to their implicit nature, neural fields lack a direct mechanism for filtering, as Fourier analysis from discrete signal processing is not directly applicable to these representations. Effective filtering of neural fields is critical to enable level-of-detail processing in downstream applications, and support operations that involve sampling the field on regular grids (e.g. marching cubes). Existing methods that attempt to decompose neural fields in the frequency domain either resort to heuristics or require extensive modifications to the neural field architecture. We show that via a simple modification, one can obtain neural fields that are low-pass filtered, and in turn show how this can be exploited to obtain a frequency decomposition of the entire signal. We demonstrate the validity of our technique by investigating level-of-detail reconstruction, and showing how coarser representations can be computed effectively.	This paper introduces BANF, a method for band-limited frequency decomposition in neural fields using a sampling-aware training process that enables low-pass filtering.	Effective filtering in neural fields is crucial for level-of-detail processing, anti-aliasing, and applications like marching cubes, but traditional Fourier analysis is not directly applicable.	BANF samples the neural field on a regular grid, applies a band-limited interpolation kernel (e.g., linear, sinc), and incorporates this interpolated output into the training loss, approximating low-pass filtering during optimization. A cascaded training scheme then enables multi-scale representation.	BANF successfully decomposes signals into frequency bands, enabling multi-scale reconstruction for images and signed distance fields (SDFs). It outperforms baselines in level-of-detail surface reconstruction from multi-view images, especially at coarser scales, demonstrating its anti-aliasing capabilities. The method is agnostic to the underlying neural field architecture, working with both fully-connected and hybrid representations.	The current implementation is memory intensive at high resolutions. The paper primarily focuses on uniformly sampled signals, and extending it to contracted representations used in NeRFs for unbounded signals is left for future work.	neural fields, frequency decomposition, anti-aliasing, level-of-detail, multi-scale representation
2404.13013 Report	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi	We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.	Introducing Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception abilities for tasks like region captioning and visual grounding.	Current MLLMs lack localization capabilities, limiting their real-world applications in areas like robotics and augmented reality. Groma addresses this by enabling region-level understanding and grounding.	Groma integrates localized visual tokenization: an image is decomposed into regions of interest, encoded into region tokens, and integrated into user instructions and model responses.	Outperforms comparable MLLMs on referring and grounding benchmarks. Demonstrates strong image-level understanding and reasoning on conversational VQA benchmarks. Exhibits robust and precise localization capabilities, surpassing alternative methods on the LVIS-Ground benchmark by a significant margin (over 10% AR).	Current implementation doesn't support free-form region inputs and pixel-level grounding. Future work involves exploring visual samplers for the region encoder and mask region proposers to address these limitations.	multimodal large language models, visual grounding, region captioning, localized visual tokenization, grounded chat
2404.12940 Report	Neural Flow Diffusion Models: Learnable Forward Process for Improved Diffusion Modelling	Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth	Conventional diffusion models typically relies on a fixed forward process, which implicitly defines complex marginal distributions over latent variables. This can often complicate the reverse process' task in learning generative trajectories, and results in costly inference for diffusion models. To address these limitations, we introduce Neural Flow Diffusion Models (NFDM), a novel framework that enhances diffusion models by supporting a broader range of forward processes beyond the fixed linear Gaussian. We also propose a novel parameterization technique for learning the forward process. Our framework provides an end-to-end, simulation-free optimization objective, effectively minimizing a variational upper bound on the negative log-likelihood. Experimental results demonstrate NFDM's strong performance, evidenced by state-of-the-art likelihood estimation. Furthermore, we investigate NFDM's capacity for learning generative dynamics with specific characteristics, such as deterministic straight lines trajectories. This exploration underscores NFDM's versatility and its potential for a wide range of applications.	The paper introduces Neural Flow Diffusion Models (NFDM), a novel framework that enhances diffusion models by allowing for flexible and learnable forward processes, going beyond fixed linear Gaussian processes.	The fixed forward process in conventional diffusion models limits the flexibility of the latent space and complicates the learning process for the reverse process. NFDM addresses this limitation, leading to improved performance and versatility.	NFDM implicitly defines the forward process through a learnable transformation. The paper proposes an end-to-end, simulation-free optimization procedure that minimizes a variational upper bound on the negative log-likelihood.	NFDM achieves state-of-the-art likelihood estimation results on CIFAR-10, ImageNet 32, and ImageNet 64 datasets. The framework allows for learning generative processes with specific characteristics, such as deterministic straight-line trajectories. NFDM with curvature penalization on trajectories (NFDM-OT) demonstrates improved computational efficiency and enhanced generation quality with fewer sampling steps.	The use of neural networks for parameterizing the forward process increases computational costs compared to conventional diffusion models. The chosen Gaussian parameterization for the forward process, while effective, may not be optimal, and exploring alternative parameterizations is left for future research.	diffusion models, generative models, variational inference, learnable forward process, likelihood estimation
2404.12887 Report	3D Multi-frame Fusion for Video Stabilization	Zhan Peng, Xinyi Ye, Weiyue Zhao, Tianqi Liu, Huiqiang Sun, Baopu Li, Zhiguo Cao	In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.	This paper introduces RStab, a novel video stabilization framework that uses 3D multi-frame fusion via volume rendering for full-frame generation and structure preservation.	Existing 2D video stabilization methods struggle with either full-frame generation or preserving structure, while 3D methods often have limited field of view. RStab overcomes these limitations.	RStab leverages Stabilized Rendering (SR), a 3D multi-frame fusion module based on volume rendering. It incorporates the Adaptive Ray Range (ARR) module for defining sampling ranges using depth priors and the Color Correction (CC) module for accurate color aggregation via optical flow.	RStab achieves full-frame video stabilization without aggressive cropping. RStab outperforms previous state-of-the-art methods on various benchmark datasets (NUS, Selfie, DeepStab). Ablation studies confirm the importance of each module (SR, ARR, CC) for achieving superior performance.	The reliance on pre-trained depth and optical flow models might impact performance if those models fail. Future work could explore joint optimization of depth/flow estimation with the proposed modules for better efficiency.	video stabilization, 3d multi-frame fusion, volume rendering, structure preservation, full-frame generation
2404.12803 Report	TextSquare: Scaling up Text-Centric Visual Instruction Tuning	Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang	Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.	This paper introduces Square-10M, a large-scale, high-quality dataset for text-centric Visual Question Answering (VQA) instruction tuning, and TextSquare, a text-centric Multimodal Large Language Model (MLLM) trained on this dataset.	Open-source MLLMs lag behind closed-source models in text-centric VQA due to the lack of extensive, high-quality instruction tuning data. This work aims to bridge this gap by providing such a dataset.	The Square-10M dataset is created using a four-step process called Square: Self-Questioning, Answering, Reasoning, and Evaluation. This involves using a closed-source MLLM (Gemini Pro) to generate VQA pairs with reasoning and then filtering them for quality. TextSquare is then trained on Square-10M and a collection of in-domain datasets.	TextSquare outperforms previous open-source text-centric MLLMs and achieves comparable or superior performance to state-of-the-art closed-source models on various benchmarks. The inclusion of VQA reasoning data in Square-10M is shown to improve model performance and mitigate hallucinations. Experiments reveal a scaling law: increasing the scale of instruction tuning data leads to better model performance, demonstrating the effectiveness and necessity of large, high-quality datasets like Square-10M.	Training large-scale models on massive datasets requires significant computational resources. While the Square strategy enhances data quality, it still falls short of human-level performance.	multimodal large language models, text-centric visual question answering, instruction tuning, dataset creation, reasoning
2404.12794 Report	MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model	Kang Zeng, Hao Shi, Jiacheng Lin, Siyu Li, Jintao Cheng, Kaiwei Wang, Zhiyong Li, Kailun Yang	LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code of this work will be made publicly available at https://github.com/Terminal-K/MambaMOS.	This paper introduces MambaMOS, a novel LiDAR-based 3D Moving Object Segmentation framework with Motion-aware State Space Model to address the weak coupling of temporal and spatial information in existing methods.	Moving object segmentation is crucial for autonomous driving systems, ensuring stable operation by providing accurate dynamic scene understanding and assisting in removing ghost effects during mapping.	MambaMOS leverages a U-Net architecture with Time Clue Bootstrapping Embedding (TCBE) and a Motion-aware State Space Model (MSSM). TCBE enhances temporal-spatial coupling in shallow layers, while MSSM, based on the State Space Model, achieves deep-level coupling by interacting with single-scan and multi-scan features.	MambaMOS achieves state-of-the-art performance on SemanticKITTI-MOS and KITTI-Road benchmarks. It effectively segments distant moving objects even with sparse point clouds by emphasizing temporal information. The method shows strong generalization ability, achieving superior results on KITTI-Road after fine-tuning with limited data.	The reliance on accurate pose information for scan alignment. Further exploration of more effective serialization techniques for better capturing spatial context.	moving object segmentation, state space model, spatio-temporal fusion, lidar point cloud, autonomous driving
2404.12784 Report	Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation	Myrna C. Silva, Mahtab Dahaghin, Matteo Toso, Alessio Del Bue	We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.	Introduces Contrastive Gaussian Clustering, a novel method for 3D scene segmentation using 3D Gaussian Splatting with a 3D feature field and contrastive learning.	Addresses the challenge of limited annotated 3D scene datasets by leveraging readily available 2D image segmentation data and handles inconsistent 2D masks to learn consistent 3D segmentation.	Augments 3D Gaussians with feature vectors, uses contrastive learning to maximize similarity within segments and minimize between, and employs spatial regularization for feature continuity.	Significantly outperforms LERF, Gaussian Grouping, and LangSplat in mIoU and mBIoU on LERF-Mask and 3D-OVS datasets. Learns multi-view consistency, enabling accurate 3D segmentation from inconsistent 2D masks. Enables real-time rendering of novel segmentation masks and 3D object selection.	Higher computational cost compared to standard 3DGS. Performance depends on the accuracy of initial 2D segmentations and object localization.	3d scene segmentation, contrastive learning, 3d gaussian splatting, novel view synthesis, weakly supervised learning
2404.12547 Report	Does Gaussian Splatting need SFM Initialization?	Yalda Foroutan, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi	3D Gaussian Splatting has recently been embraced as a versatile and effective method for scene reconstruction and novel view synthesis, owing to its high-quality results and compatibility with hardware rasterization. Despite its advantages, Gaussian Splatting's reliance on high-quality point cloud initialization by Structure-from-Motion (SFM) algorithms is a significant limitation to be overcome. To this end, we investigate various initialization strategies for Gaussian Splatting and delve into how volumetric reconstructions from Neural Radiance Fields (NeRF) can be utilized to bypass the dependency on SFM data. Our findings demonstrate that random initialization can perform much better if carefully designed and that by employing a combination of improved initialization strategies and structure distillation from low-cost NeRF models, it is possible to achieve equivalent results, or at times even superior, to those obtained from SFM initialization.	This paper investigates initialization strategies for 3D Gaussian Splatting, aiming to remove the dependence on Structure-from-Motion (SfM) data by leveraging Neural Radiance Fields (NeRF).	Gaussian Splatting relies on high-quality point cloud initialization from SfM, which is computationally expensive and can be unreliable in certain scenarios like SLAM or autonomous vehicle applications.	The authors experiment with different initialization strategies: 1) improved random initialization within a large bounding box, 2) point cloud initialization from a pre-trained NeRF model, and 3) depth supervision from a pre-trained NeRF model during Gaussian Splatting training.	Carefully designed random initialization, specifically a large uniform initialization, outperforms previous attempts and achieves competitive results. Initializing Gaussian Splatting with points sampled from a pre-trained NeRF model surpasses random initialization and, in some cases, even outperforms SfM initialization. Adding depth supervision from the pre-trained NeRF model further improves the performance of Gaussian Splatting, achieving the best overall results.	The proposed method still requires camera calibration, which is often obtained from SfM. The performance of the NeRF pre-training can be sensitive to the scene, requiring further research to automate the NeRF configuration process.	gaussian splatting, nerf, sfm, initialization, depth supervision
2404.12541 Report	GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models	Sai Sree Harsha, Ambareesh Revanur, Dhwanit Agarwal, Shradha Agrawal	Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.	GenVideo, a novel framework for editing videos using target-image aware text-to-image (T2I) diffusion models.	Existing video editing methods based on diffusion models struggle to make temporally consistent edits when the shape and size of the object in the target image differ from the source object. They are also often limited by the expressive power of text prompts.	GenVideo leverages target-image aware T2I models and introduces two novel components: InvEdit and latent correction. InvEdit generates target-image and shape-aware masks to identify regions of interest. Latent correction improves temporal consistency by blending inter-frame latents.	GenVideo can effectively handle edits with objects of varying shapes and sizes, outperforming existing methods. InvEdit masks accurately identify regions of interest, enabling localized edits. Latent correction strategy improves the temporal consistency of edits, even for objects with substantial shape differences.	The quality of edits is limited by the underlying T2I model. Fine-grained inconsistencies may remain, especially for complex objects.	video editing, diffusion models, target-image awareness, temporal consistency, invedit
2404.12391 Report	On the Content Bias in Fréchet Video Distance	Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang	Fr\'echet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.	This paper presents a systematic study quantifying the bias of Fréchet Video Distance (FVD) towards per-frame quality over temporal realism in video generation.	Accurately evaluating the quality and diversity of generated videos is crucial with the rapid progress in video generation, and understanding the limitations of widely used metrics like FVD is essential.	The authors analyze FVD's sensitivity to temporal aspects by: (1) Distorting videos with controlled spatial and spatiotemporal corruptions, (2) Probing the perceptual null space by resampling generated videos to minimize FVD without improving temporal quality, and (3) Examining real-world examples where FVD contradicts human perception.	FVD exhibits low sensitivity to temporal inconsistencies, often favoring videos with better frame quality over temporal realism. Resampling generated videos without motion can still significantly reduce FVD, indicating a large perceptual null space where temporal quality is disregarded. FVD computed with features from self-supervised video models (e.g., VideoMAE-v2) trained on diverse datasets is less biased towards frame quality and more sensitive to temporal inconsistencies.	The impact of resizing high-resolution generated videos and handling non-square aspect ratios on FVD remains unexplored. Computing FVD with self-supervised features for long videos is computationally expensive and requires further investigation.	video generation, evaluation metrics, fréchet video distance (fvd), content bias, self-supervised learning
2404.12390 Report	BLINK: Multimodal Large Language Models Can See but Not Perceive	Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna	We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.	BLINK, a new benchmark for multimodal language models (LLMs), focuses on core visual perception abilities like depth estimation, correspondence, and 3D reasoning, which are often overlooked in other evaluations.	Existing multimodal LLM benchmarks often conflate perception with language knowledge and reasoning, primarily evaluating perception as a dense captioning task. BLINK aims to highlight and assess the nuanced perception capabilities of LLMs, going beyond recognition-based tasks.	BLINK reimagines 14 classic computer vision problems, ranging from low-level pattern matching to high-level visual understanding, into 3,807 multiple-choice questions paired with images and visual prompts. These tasks are designed to be easily solvable by humans but difficult to address through dense captioning alone.	Humans achieve 95.70% average accuracy on BLINK, while even the best-performing LLMs (GPT-4V, Gemini) struggle, achieving accuracies of 51.26% and 45.72% respectively. Multimodal LLMs show relative strengths in mid-level perception tasks like spatial reasoning and counting but struggle with pixel-level tasks like relative reflectance. Specialist computer vision models significantly outperform LLMs on BLINK tasks, suggesting potential for improvement by integrating insights from specialized models.	BLINK relies on existing image datasets and may not encompass all real-world visual perception abilities. Future work could explore incorporating a wider range of visual perception tasks and developing novel evaluation metrics that better capture the nuanced aspects of visual understanding in LLMs.	multimodal llms, visual perception, benchmarking, computer vision, artificial intelligence
2404.12389 Report	Moving Object Segmentation: All You Need Is SAM (and Flow)	Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman	The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.	This paper explores adapting the Segment Anything Model (SAM) for moving object segmentation in videos, introducing two methods: FlowSAM, which uses optical flow as input, and MotionSAM, which uses optical flow as a prompt for guiding SAM on RGB inputs.	Moving object segmentation is a challenging task, and SAM, despite its success in image segmentation, needs adaptation for video. This paper investigates simple yet effective ways to leverage SAM’s power for this task.	The paper introduces FlowSAM, which fine-tunes SAM on optical flow inputs, and MotionSAM, which uses a trainable prompt generator to feed flow-derived prompts to SAM processing RGB frames. They further propose a sequence-level mask association method for maintaining object identity across frames.	FlowSAM with flow-only inputs outperforms previous methods by a large margin (>10%) on moving object segmentation benchmarks. MotionSAM, using RGB+flow, achieves state-of-the-art performance, especially excelling at multi-object benchmarks. Combining FlowSAM and MotionSAM further boosts performance, demonstrating the complementary roles of flow and RGB modalities.	The methods suffer from extended running time due to SAM’s computationally heavy image encoder. The sequence-wise association, while strong, can be improved with longer temporal context.	motion segmentation, video object segmentation, segment anything model (sam), optical flow, motion-based object discovery
2404.12388 Report	VideoGigaGAN: Towards Detail-rich Video Super-Resolution	Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu	Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.	Introducing VideoGigaGAN, the first large-scale GAN-based model for video super-resolution, generating high-frequency details and temporal consistency.	Existing VSR models struggle to balance temporal consistency with generating realistic high-frequency details.	Building upon GigaGAN, the authors add: 1) temporal modules (convolution and attention) to the decoder, 2) a flow-guided feature propagation module, 3) anti-aliasing blocks in the encoder, and 4) a high-frequency shuttle mechanism.	VideoGigaGAN generates sharper, more detailed videos than state-of-the-art VSR methods. The model successfully performs 8x video upsampling with good detail and temporal consistency. A new metric, Referenced Warping Error (RWE), is proposed for evaluating temporal consistency in VSR.	Struggles with extremely long videos due to optical flow inaccuracies. Performance degrades with very small objects (e.g. text) due to information loss in the LR input.	video super-resolution, generative adversarial networks, temporal consistency, high-frequency details, gigagan
2404.12387 Report	Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models	Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie	We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at http://chat.reka.ai . A showcase of non cherry picked qualitative examples can also be found at http://showcase.reka.ai .	This paper introduces Reka Core, Flash, and Edge, a series of multimodal language models (MLLMs) trained from scratch by Reka.	Reka models are important because they are state-of-the-art for their compute class, outperforming many much larger models. They can process text, images, video, and audio, achieving competitive performance to other frontier models on various benchmarks.	The models use a modular encoder-decoder transformer architecture, trained on a massive dataset of text and multimodal data. They are aligned and instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).	Reka Core approaches the performance of GPT-4V on image question answering and outperforms Claude 3 on multimodal chat. Reka Flash outperforms GPT-3.5 Turbo and models much larger in size, like Grok-1 and Gemini Pro 1.0. Reka Edge surpasses other state-of-the-art 7B models such as Gemma 7B and Mistral 7B.	Reka Core has not finished training and is still being improved. Limited details about the tool-use, function calling, and web search capabilities are provided in the report.	multimodal language models, large language models, computer vision, natural language processing, benchmarking
2404.12386 Report	SOHES: Self-supervised Open-world Hierarchical Entity Segmentation	Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang	Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.	This paper introduces SOHES, a self-supervised open-world hierarchical entity segmentation approach that eliminates the reliance on human annotations.	Existing open-world entity segmentation models rely heavily on costly human-annotated datasets, limiting their scalability and practicality.	SOHES operates in three self-supervised phases: 1) Self-exploration: generates initial pseudo-labels by clustering visual features from a pre-trained DINO representation. 2) Self-instruction: trains a segmentation model (DINO backbone + Mask2Former) on the pseudo-labels to refine segmentation. 3) Self-correction: further refines the model using a teacher-student mutual-learning framework.	SOHES achieves state-of-the-art performance in self-supervised open-world segmentation, significantly closing the gap with supervised methods. The method effectively segments both whole entities and their constituent parts, providing a hierarchical understanding of visual scenes. SOHES-trained ViT backbones demonstrate improved performance on downstream dense prediction tasks like semantic segmentation and object detection.	SOHES may struggle with discontinuous or occluded entities, text overlays, and blurry backgrounds. Future work will explore improved pseudo-labeling strategies to address these limitations.	self-supervised learning, open-world segmentation, hierarchical segmentation, entity segmentation, teacher-student learning
2404.12385 Report	MeshLRM: Large Reconstruction Model for High-Quality Mesh	Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, Zexiang Xu	We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering. Moreover, we improve the LRM architecture by simplifying several complex designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained with low- and high-resolution images; this new LRM training strategy enables significantly faster convergence and thereby leads to better quality with less compute. Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation. Project page: https://sarahweiii.github.io/meshlrm/	Presents MeshLRM, a novel LRM-based framework that integrates differentiable mesh extraction and rendering for end-to-end few-shot high-quality mesh reconstruction.	High-quality 3D meshes are essential for various applications, and existing methods for mesh reconstruction are either time-consuming or require dense input images.	The method leverages a transformer-based LRM architecture with simplified image tokenization and triplane decoding. It incorporates differentiable marching cubes and rendering for end-to-end mesh optimization and introduces a ray opacity loss to stabilize training.	Achieves state-of-the-art mesh reconstruction from sparse-view inputs, outperforming existing feed-forward and optimization-based methods. Significantly faster than per-scene optimization methods, enabling mesh reconstruction in less than one second. Demonstrates strong generalization ability on real datasets and enables high-quality text-to-3D and image-to-3D generation.	Limited robustness for scenes with complex materials due to the assumption of Lambertian appearance. Requires input camera poses, which can be challenging to obtain accurately for real captures.	sparse-view reconstruction, high-quality mesh, large reconstruction models, differentiable rendering, 3d generation
2404.12382 Report	Lazy Diffusion Transformer for Interactive Image Editing	Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi	We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.	Introduces Gazelle, a diffusion transformer model that efficiently generates partial image updates for interactive image editing by processing only masked regions.	Existing diffusion-based inpainting methods are computationally expensive, regenerating the entire image or relying on limited local context, hindering interactivity and global consistency.	Gazelle employs an encoder-decoder architecture. The encoder compresses the full image and mask into a compact global context. The decoder, a diffusion transformer, then iteratively generates only the masked pixels conditioned on this context and the text prompt.	Achieves up to 10x speedup over full-image inpainting methods for small masks typical in interactive editing. Maintains competitive image quality and fidelity compared to state-of-the-art inpainting models. Demonstrates the effectiveness of compressed global context in preserving semantic information for coherent inpainting.	The context encoder's quadratic scaling with input size may limit scalability to very high-resolution images. Occasional color discrepancies between generated and visible regions require further investigation for more principled solutions.	image inpainting, diffusion models, transformers, interactive image editing, context encoding
2404.12352 Report	Point-In-Context: Understanding Point Cloud via In-Context Learning	Mengyuan Liu, Zhongbin Fang, Xia Li, Joachim M. Buhmann, Xiangtai Li, Chen Change Loy	With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.	This paper introduces Point-In-Context (PIC), the first in-context learning framework for 3D point cloud understanding.	In-context learning enables efficient model adaptation and generalization without parameter updates, addressing resource constraints associated with large-scale model fine-tuning.	PIC leverages a Joint Sampling module to overcome information leakage and data disarray in 3D point clouds. Two versions are proposed: PIC-G for multitasking and PIC-S for part segmentation. PIC-S further introduces In-Context Labeling and In-Context Enhancing strategies for dynamic context-aware segmentation.	PIC-G achieves state-of-the-art results on a multitask benchmark comprising reconstruction, denoising, registration, and part segmentation tasks. PIC-S outperforms existing methods on a large-scale Human & Object Segmentation benchmark. PIC-S demonstrates strong generalization capabilities, effectively segmenting unseen datasets like AKB-48.	The performance of PIC is dependent on the quality of prompts, suggesting potential improvements through better prompt selection. The random label assignment in PIC-S, while enabling generalization, can pose challenges for model training.	in-context learning, point cloud analysis, multi-task learning, part segmentation, 3d vision
2404.12347 Report	AniClipart: Clipart Animation with Text-to-Video Priors	Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao	Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define B\'{e}zier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.	AniClipart is a system that transforms static clipart images into high-quality motion sequences guided by text prompts, preserving visual identity and achieving motion consistency.	Automating clipart animation is crucial due to the increasing demand and the labor-intensive nature of traditional methods. Existing text-to-video models struggle to retain clipart's visual style and generate cartoon-style motions.	AniClipart defines keypoints on clipart, assigns Bézier curve trajectories, and leverages Video Score Distillation Sampling (VSDS) loss to optimize trajectories based on text-to-video priors. A differentiable As-Rigid-As-Possible deformation maintains shape rigidity during animation.	AniClipart outperforms existing image-to-video generation models in text-video alignment, visual identity preservation, and motion consistency. Ablation studies confirm the importance of ARAP deformation, Bézier-driven animation, skeleton loss, and VSDS loss for high-quality results. AniClipart is extended to handle layered animation, accommodating topological changes for more complex animations.	AniClipart's animation diversity is limited by the capabilities of current text-to-video models. Generating motions that significantly deviate from the initial clipart pose remains challenging due to limitations in video model capacity.	clipart animation, text-to-video generation, score distillation sampling, as-rigid-as-possible deformation, bézier curves
2404.12333 Report	Customizing Text-to-Image Diffusion with Camera Viewpoint Control	Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu	Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.	This paper introduces CustomDiffusion360, a method for customizing text-to-image diffusion models with explicit control over the camera viewpoint of newly introduced objects.	Existing model customization techniques lack precise control over the camera viewpoint of the generated objects, hindering users' ability to generate diverse and specific outputs.	CustomDiffusion360 bridges the gap between 3D neural representations of custom objects and 2D text-to-image diffusion models by leveraging a novel pose-conditioned transformer block. This block uses FeatureNeRF, a module that learns to predict 3D features from multi-view images of the custom object and renders them into 2D features conditioned on the target camera pose. These rendered features are then fused with the diffusion model's internal features to guide the generation process.	CustomDiffusion360 outperforms existing image editing and model personalization techniques in generating high-quality images that accurately reflect the target object's identity, camera pose, and input text prompt. The method generalizes well to novel camera viewpoints, even those outside the training distribution. CustomDiffusion360 can be combined with other image editing techniques for tasks like object in-painting and panorama generation, enabling more creative applications.	The method may struggle to generalize to extreme camera poses significantly different from the training data. Generating scenes with multiple custom objects and ensuring their accurate pose control remains an open challenge.	text-to-image synthesis, model customization, camera pose control, diffusion models, nerf
2404.12168 Report	Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization	Insoo Kim, Jae Seok Choi, Geonseok Seo, Kinam Kwon, Jinwoo Shin, Hyong-Euk Lee	As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.	This paper presents a novel deblurring scheme that decomposes the regression task into two simpler tasks: blur pixel discretization (classifying blur at the pixel level) and discrete-to-continuous conversion (regression guided by a blur class map). This approach is more computationally efficient than directly solving the regression problem.	With the increasing demand for efficient deblurring models that can handle large motion in high-resolution images, particularly on resource-constrained devices, this paper addresses the need for efficient and effective deblurring solutions.	The authors propose a two-stage model. First, a blur pixel discretizer generates a blur segmentation map reflecting image residual errors. Second, a discrete-to-continuous (D2C) converter transforms this map into a continuous form to refine the deblurred image. The method leverages the logarithmic Fourier space to simplify the relationship between blurred and sharp images during training.	The proposed method achieves competitive deblurring results compared to state-of-the-art methods while being up to 10 times more computationally efficient. The generated blur segmentation map, acting as a form of ground truth, significantly improves deblurring performance, especially for efficient models. The method shows promising results in both objective evaluations on standard benchmarks and visual comparisons against commercial deblurring applications.	The model's performance might be affected by using different datasets for training the blur pixel discretizer and the D2C converter due to variations in image characteristics and blur types. Further acceleration is possible by deploying the model on NPUs instead of GPUs to enhance its real-time applicability on mobile devices.	image deblurring, motion blur, efficient deep learning, blur segmentation, discrete-to-continuous conversion
2404.12154 Report	StyleBooth: Image Style Editing with Multimodal Instruction	Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang	Given an original image, image editing aims to generate an image that align with the provided instruction. The challenges are to accept multimodal inputs as instructions and a scarcity of high-quality training data, including crucial triplets of source/target image pairs and multimodal (text and image) instructions. In this paper, we focus on image style editing and present StyleBooth, a method that proposes a comprehensive framework for image editing and a feasible strategy for building a high-quality style editing dataset. We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions. Furthermore, by iterative style-destyle tuning and editing and usability filtering, the StyleBooth dataset provides content-consistent stylized/plain image pairs in various categories of styles. To show the flexibility of StyleBooth, we conduct experiments on diverse tasks, such as text-based style editing, exemplar-based style editing and compositional style editing. The results demonstrate that the quality and variety of training data significantly enhance the ability to preserve content and improve the overall quality of generated images in editing tasks. Project page can be found at https://ali-vilab.github.io/stylebooth-page/.	This paper introduces StyleBooth, a novel approach for image style editing that leverages multimodal instructions, encompassing both textual descriptions and exemplar images.	Existing image editing methods often struggle to handle both text and image-based instructions effectively or lack sufficient training data with diverse and high-quality examples. This work addresses these limitations.	StyleBooth employs a unified conditioning scheme for diffusion models, enabling the integration of text and image exemplars as instructions. It uses a novel dataset construction pipeline based on iterative style-destyle tuning and usability filtering to ensure high-quality training data.	StyleBooth achieves state-of-the-art performance in text-based style editing, outperforming baselines in terms of accuracy and user preference. It excels in exemplar-based style editing, accurately transferring styles from exemplars while preserving content fidelity better than competing methods. The multimodal instruction mechanism allows for compositional style editing, enabling users to blend and interpolate styles from different sources.	The current dataset, although diverse, is primarily built upon textual descriptions of styles, potentially limiting the range of styles covered. Future work will focus on expanding the dataset with a broader spectrum of styles and exploring additional image editing tasks beyond style editing.	image style editing, multimodal instruction, diffusion models, dataset generation, style composition
2404.11958 Report	Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation	Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, Jianke Zhu	Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.	This paper introduces HASSC, a hardness-aware semantic scene completion scheme designed to enhance the performance of existing methods in challenging regions.	Existing semantic scene completion methods treat all voxels equally during training, neglecting the varying difficulty in predicting different voxels. This leads to suboptimal performance in challenging regions, especially for vision-centric methods.	HASSC utilizes a hard voxel mining (HVM) head that identifies hard voxels based on global hardness (prediction uncertainty) and local hardness (geometric anisotropy). A refinement module then focuses on these hard voxels, improving their prediction accuracy. Additionally, a self-distillation strategy enhances training stability and consistency.	HASSC consistently improves the accuracy of various baseline models, including VoxFormer and StereoScene, on the SemanticKITTI benchmark. The method shows significant improvements at closer ranges, crucial for autonomous driving safety. HASSC achieves these gains without incurring additional computational costs during inference.	The performance gap between camera-based and LiDAR-based methods remains significant in the full range. The method's performance is limited by inaccurate geometry estimation and the long-tail distribution of certain object categories. Future work will explore incorporating neural radiance fields (NeRFs) to improve geometric and semantic understanding from image sequences.	semantic scene completion, hard voxel mining, self-distillation, autonomous driving, 3d vision
2404.11949 Report	Sketch-guided Image Inpainting with Partial Discrete Diffusion Process	Nakul Sharma, Aditay Tripathi, Anirban Chakraborty, Anand Mishra	In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at https://github.com/vl2g/Sketch-Inpainting .	This paper introduces a novel method for sketch-guided image inpainting using a partial discrete diffusion process (PDDP), allowing users to control the shape and pose of inpainted objects.	Existing image inpainting methods often lack control over the semantic details and visual attributes of the inpainted regions. This work provides a solution by incorporating sketch guidance, offering greater user control and addressing a gap in the field.	The method involves training a two-stage model. The first stage learns a discrete latent space of images. The second stage utilizes this latent space to perform sketch-guided inpainting using a novel PDDP and a sketch-guided bi-directional transformer.	The proposed method outperforms existing image inpainting approaches adapted for sketch guidance, achieving state-of-the-art results on a curated MS-COCO dataset. The model effectively utilizes visual information from hand-drawn sketches, resulting in inpainted images with high visual fidelity and faithfulness to the query sketches. User studies confirm the superiority of the proposed method, with participants preferring its generated inpainted images for their naturalness, visual fidelity, and alignment with the input sketches.	The quality of inpainted images can be further improved, particularly in capturing intricate object details. The current sketch embedding method could be enhanced to better represent stroke-level details and improve conditioning mechanisms.	image inpainting, sketch guidance, discrete diffusion models, bidirectional transformer, generative models
2404.11936 Report	LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights	Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi	Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.	This paper introduces LD-Pruner, a novel structured pruning method for compressing Latent Diffusion Models (LDMs) while preserving performance.	Deploying LDMs on resource-limited devices is challenging due to memory consumption and inference speed. Existing pruning methods are not tailored to LDMs and lack efficient, task-agnostic performance evaluation.	LD-Pruner leverages the latent space to evaluate the impact of pruning individual operators (e.g., convolutional layers) on model performance. It modifies each operator, generates latent representations, and quantifies the divergence from the original representations using a novel scoring formula.	Achieves 34.9% inference speedup with 5.2% FID improvement on text-to-image generation (Stable Diffusion) compared to the original model. Demonstrates successful compression and performance preservation for unconditional image generation (LDM-4) and unconditional audio generation (AudioDiffusion). Highlights the importance of weight preservation during pruning for faster and better fine-tuning.	The current method does not prune the decoder part of LDMs. It does not explicitly account for potential dependencies between operators during pruning.	latent diffusion models, model compression, pruning, task-agnostic, latent space
2404.11925 Report	EdgeFusion: On-Device Text-to-Image Generation	Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim	The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.	This paper presents EdgeFusion, an optimized Stable Diffusion model for fast text-to-image generation on resource-limited devices, achieving under one second latency on Samsung Exynos NPU.	Stable Diffusion models are computationally expensive, hindering their deployment on edge devices. EdgeFusion addresses this by reducing sampling steps, optimizing architecture, and employing efficient deployment strategies.	The study leverages Block-removed Knowledge-distilled SDM and Latent Consistency Model for model compression and step reduction. It utilizes high-quality synthetic image-text pairs for improved training and employs model-level tiling and quantization for efficient NPU deployment.	EdgeFusion generates high-quality images in just two steps with latency under one second on edge devices. Using high-quality synthetic data significantly improves generation quality and text-image alignment compared to using solely LAION datasets. The proposed advanced distillation process, including fine-tuning the student model with a superior teacher and using the original large model during LCM training, significantly enhances few-step generation quality.	The study primarily focuses on the Samsung Exynos NPU, potentially limiting the generalizability of findings to other edge devices. Further investigation into the trade-off between dataset size and quality, particularly for manual curation, is needed.	text-to-image generation, stable diffusion, edge computing, model compression, knowledge distillation
2404.11895 Report	FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models	Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, Antoni B. Chan	Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.	This paper introduces FreeDiff, a novel fine-tuning free approach that refines the guidance of diffusion models for universal editing tasks by employing progressive frequency truncation.	Existing text-guided image editing methods often struggle with misalignment between the intended precise editing target regions and the broader area impacted by the guidance, while attention manipulation methods lack versatility and generality.	FreeDiff leverages the observation that the denoising network in diffusion models prioritizes learning frequency components in correlation with the noise level across timesteps. It then employs progressive frequency truncation on the guidance in the frequency space during the image generation process.	FreeDiff achieves comparable results with state-of-the-art methods across various editing tasks on a diverse set of images. Analysis of intermediate features during diffusion confirms that the network prioritizes low-frequency components, explaining the misalignment in editing. Ablation studies confirm the effectiveness of progressive frequency truncation and its sensitivity to editing prompts.	FreeDiff's performance relies on successful image reconstruction and can be sensitive to editing prompts, especially those describing non-target regions. Future work includes exploring the combination of FreeDiff with attention manipulation techniques for enhanced control.	diffusion models, image editing, frequency truncation, text-guided image editing, guidance refinement
2404.11824 Report	TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation	Tianyi Liang, Jiangqi Liu, Sicheng Song, Shiqi Jiang, Yifei Huang, Changbo Wang, Chenhui Li	Recent advancements in Text-to-image (T2I) generation have witnessed a shift from adapting text to fixed backgrounds to creating images around text. Traditional approaches are often limited to generate layouts within static images for effective text placement. Our proposed approach, TextCenGen, introduces a dynamic adaptation of the blank region for text-friendly image generation, emphasizing text-centric design and visual harmony generation. Our method employs force-directed attention guidance in T2I models to generate images that strategically reserve whitespace for pre-defined text areas, even for text or icons at the golden ratio. Observing how cross-attention maps affect object placement, we detect and repel conflicting objects using a force-directed graph approach, combined with a Spatial Excluding Cross-Attention Constraint for smooth attention in whitespace areas. As a novel task in graphic design, experiments indicate that TextCenGen outperforms existing methods with more harmonious compositions. Furthermore, our method significantly enhances T2I model outcomes on our specially collected prompt datasets, catering to varied text positions. These results demonstrate the efficacy of TextCenGen in creating more harmonious and integrated text-image compositions.	TextCenGen is a novel, training-free framework for text-centric text-to-image generation. It dynamically adapts image composition around predefined text regions for visually harmonious text integration, addressing a gap in existing methods that struggle with text-background conflicts.	Effective text-image synergy is crucial in graphic design. Traditional methods often result in text-background competition. TextCenGen addresses this by prioritizing text placement in image generation, ensuring clear communication and aesthetic appeal.	TextCenGen utilizes cross-attention maps and force-directed graphs to guide object placement during the denoising process of text-to-image generation. It identifies and relocates objects conflicting with designated text regions and applies a spatial constraint for smooth attention in those areas.	TextCenGen outperforms existing state-of-the-art methods in quantitative metrics, demonstrating superior performance in background smoothness, saliency harmony, and semantic fidelity. Qualitative analysis highlights TextCenGen's ability to create more natural and harmonious text layouts while preserving image content and quality. The ablation study confirms the significant contribution of both the Force-Directed Cross-Attention Guidance and Spatial Excluding Cross-Attention Constraint in achieving these results.	The assumption of convex object shapes in the force-directed guidance may not be suitable for all scenarios. The generation of unintended objects in blank areas requires further investigation and refinement.	text-to-image generation, text-centric design, force-directed attention, cross-attention maps, graphic design
2404.11778 Report	CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration	Rui Deng, Tianpei Gu	Reconstructing degraded images is a critical task in image processing. Although CNN and Transformer-based models are prevalent in this field, they exhibit inherent limitations, such as inadequate long-range dependency modeling and high computational costs. To overcome these issues, we introduce the Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a Spatial SSM module for global context encoding and a Channel SSM component to preserve channel correlation features, both in linear computational complexity relative to the feature map size. Extensive experimental results validate CU-Mamba's superiority over existing state-of-the-art methods, underscoring the importance of integrating both spatial and channel contexts in image restoration.	Introduced Channel-Aware U-Shaped Mamba (CU-Mamba), integrating dual State Space Models (SSM) within a U-Net for image restoration, capturing long-range dependencies and preserving channel correlations.	Addresses limitations of CNNs (limited receptive fields) and Transformers (high computational cost) in image restoration by efficiently encoding global and channel-specific features.	Employs Spatial SSM for global context encoding in linear complexity and Channel SSM to enhance feature mixing across channels within the U-Net architecture.	Outperforms state-of-the-art methods on image denoising (SIDD, DND) and deblurring (GoPro, HIDE, RealBlur-R, RealBlur-J) benchmarks. Demonstrates faster inference speed compared to Transformer-based methods while achieving superior restoration quality. Ablation studies validate the effectiveness of both Spatial and Channel SSM modules in enhancing model performance.	Exploration of alternative SSM discretization techniques for potential performance improvement. Investigating the application of CU-Mamba to other image restoration tasks beyond denoising and deblurring.	image restoration, state space models, u-net, channel learning, deep learning
2404.11615 Report	Factorized Diffusion: Perceptual Illusions by Noise Decomposition	Daniel Geng, Inbum Park, Andrew Owens	Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.	This paper introduces Factorized Diffusion, a zero-shot method to control individual components of an image during generation with diffusion models by manipulating the noise estimates for different image decompositions.	This method enables the creation of various perceptual illusions, like hybrid images that change with viewing distance, color hybrids that change under different lighting, and motion hybrids that change when blurred, offering insights into human and machine perception.	The method decomposes an image into components (e.g., frequency bands, color spaces) and generates separate noise estimates for each component conditioned on different text prompts. These noise estimates are then combined to guide the denoising process, resulting in an image where each component reflects its corresponding prompt.	Factorized Diffusion successfully synthesizes hybrid images that outperform traditional methods in quality and alignment with prompts, as evaluated through human studies and CLIP score. The method generalizes to other decompositions, generating color hybrids and motion hybrids, showcasing its ability to create new classes of perceptual illusions. The technique can be extended to solve inverse problems by fixing one component and generating others, demonstrated by creating hybrid images from real images and performing text-guided colorization.	The success rate of generating high-quality illusions can be low due to the out-of-distribution nature of the generated images and the lack of control over prompt interactions. Future work includes improving robustness, exploring other decompositions, and addressing ethical considerations related to the generation of potentially deceptive content.	diffusion models, perceptual illusions, hybrid images, image decomposition, text-conditional image generation
2404.11614 Report	Dynamic Typography: Bringing Text to Life via Video Diffusion Prior	Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu	Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: https://animate-your-word.github.io/demo/.	This paper introduces "Dynamic Typography," an automated system that animates individual letters within words based on user prompts, deforming them to embody semantic meaning while maintaining legibility.	This technique addresses the challenge of creating semantically aware and visually engaging text animations, a task typically requiring significant design and animation expertise.	The system uses an end-to-end optimization framework with two neural displacement fields: one for shaping the letter to reflect the prompt's meaning and another for applying per-frame motion. It leverages score-distillation sampling with a text-to-video model, incorporates legibility regularization using LPIPS, and employs mesh-based structure preservation to maintain visual consistency.	The generated animations accurately and aesthetically interpret text prompts while preserving letter readability. Quantitative evaluation demonstrates superiority over baseline methods in maintaining legibility and prompt-video alignment. The framework is generalizable across different text-to-video models, allowing for improvements with future advancements in video generation.	Motion quality is limited by the capabilities of the video foundation model. Balancing semantic accuracy with legibility becomes challenging when prompts significantly deviate from original letter forms.	text animation, kinetic typography, video diffusion prior, svg, text-to-video generation
2404.11613 Report	InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior	Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, Yang Cao	3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.	Presents InFusion, a novel approach for inpainting 3D Gaussian representations by leveraging depth completion learned from diffusion priors, enabling efficient and photorealistic editing of 3D scenes.	Addresses the limitations of existing 3D Gaussian inpainting methods that often produce blurry textures or misaligned depth, hindering the seamless integration of edited elements.	Inpaints the reference image and depth map, unprojects them to initialize 3D points, and fine-tunes the Gaussian model using a diffusion-based depth completion model trained on a large-scale dataset.	Achieves superior image quality with sharper textures and better 3D consistency compared to baseline methods. Demonstrates significant speed improvements, being up to 20 times faster than existing techniques. Enables practical applications such as user-interactive texture editing and object insertion.	Faces challenges in scenarios with significant lighting variations across different views, leading to inconsistencies in the inpainted regions. Limited in text-guided inpainting of highly complex objects within 360-degree scenes due to the current constraints of inpainting models.	gaussian splatting, 3d inpainting, depth completion, diffusion models, novel view synthesis
2404.11593 Report	IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination	Xi Chen, Sida Peng, Dongchen Yang, Yuan Liu, Bowen Pan, Chengfei Lv, Xiaowei Zhou	This paper aims to recover object materials from posed images captured under an unknown static lighting condition. Recent methods solve this task by optimizing material parameters through differentiable physically based rendering. However, due to the coupling between object geometry, materials, and environment lighting, there is inherent ambiguity during the inverse rendering process, preventing previous methods from obtaining accurate results. To overcome this ill-posed problem, our key idea is to learn the material prior with a generative model for regularizing the optimization process. We observe that the general rendering equation can be split into diffuse and specular shading terms, and thus formulate the material prior as diffusion models of albedo and specular. Thanks to this design, our model can be trained using the existing abundant 3D object data, and naturally acts as a versatile tool to resolve the ambiguity when recovering material representations from RGB images. In addition, we develop a coarse-to-fine training strategy that leverages estimated materials to guide diffusion models to satisfy multi-view consistent constraints, leading to more stable and accurate results. Extensive experiments on real-world and synthetic datasets demonstrate that our approach achieves state-of-the-art performance on material recovery. The code will be available at https://zju3dv.github.io/IntrinsicAnything.	This paper introduces IntrinsicAnything, a novel method that leverages diffusion models to learn material priors for single-view inverse rendering under unknown lighting, effectively addressing the inherent ambiguities in material and lighting decomposition.	Inverse rendering under unknown lighting is crucial for various applications like VR/AR and video games but suffers from inherent ambiguities that hinder accurate material recovery.	IntrinsicAnything utilizes conditional diffusion models to learn priors for albedo and specular shading. It employs a two-stage optimization process: first recovering coarse material and lighting, then using them to guide the diffusion model for refined, multi-view consistent results.	IntrinsicAnything achieves state-of-the-art performance on both synthetic and real-world datasets, outperforming existing optimization-based and data-driven methods. The method effectively disentangles materials and lighting, avoiding common issues like baking shadows or shading into the albedo. IntrinsicAnything demonstrates strong generalization capabilities, enabling high-quality single-view intrinsic image decomposition for diverse objects and scenes, including challenging in-the-wild images.	The current method doesn't handle transparent objects, necessitating further exploration of geometry representations and joint optimization. Performance relies on the accuracy of reconstructed geometry, suggesting future research on using diffusion models for improved geometry priors in 3D reconstruction.	inverse rendering, diffusion models, material prior, generative models, single-view reconstruction
2404.11589 Report	Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding	Zezhong Fan, Xiaohan Li, Chenhao Fang, Topojoy Biswas, Kaushiki Nag, Jianpeng Xu, Kannan Achan	The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.	This paper introduces POAC (Prompt Optimizer for Abstract Concepts) to improve how text-to-image models understand and generate images from abstract concepts.	Existing text-to-image models struggle to depict abstract ideas because they are trained mainly on concrete objects and lack a mapping between abstract and concrete representations.	POAC uses a two-stage approach: 1) It fine-tunes a Prompt Language Model (PLM) to rewrite prompts containing abstract concepts into prompts with concrete objects using GPT-4 and a curated dataset. 2) It uses Reward Feedback Learning (ReFL) to fine-tune a Stable Diffusion XL model to align with the optimized prompts and improve image quality.	POAC enables the generation of images that are more faithful to abstract concept prompts, including relevant concrete details. Fine-tuning with ReFL further improves the alignment between optimized prompts and generated images, leading to more accurate depictions. Quantitative evaluation shows improvements in both relevance and aesthetic scores of generated images compared to baseline SDXL.	Future work will address broader alignment challenges beyond abstract concepts, such as mitigating biases in generated images. The authors will explore improving the prompt language model to optimize for balanced and fair representations across different demographic groups.	image generation, diffusion models, prompt optimization, abstract concepts, reinforcement learning
2404.11554 Report	Predicting Long-horizon Futures by Conditioning on Geometry and Time	Tarasha Khurana, Deva Ramanan	Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.	This paper presents a video prediction model that leverages pre-trained 2D image diffusion models and incorporates timestamp conditioning to generate future frames from past observations.	Predicting future sensor observations is crucial for robotics applications like self-driving, and this work offers an efficient solution by repurposing readily available image diffusion models for video prediction.	The method involves fine-tuning pre-trained image diffusion models by adding (1) conditioning on input context frames using a two-stream approach with CLIP embeddings and (2) conditioning on frame timestamps using positional encoding. The model is trained to predict frames at random timestamps, enabling flexible sampling strategies at inference time.	The model outperforms state-of-the-art video prediction methods on short-horizon forecasting tasks. Introducing invariances in the data, such as using pseudo-depth or luminance instead of RGB, significantly improves performance. The proposed mixed sampling strategy, enabled by timestamp conditioning, outperforms traditional autoregressive and hierarchical sampling for long-horizon forecasting.	The model exhibits bias towards hallucinating commonly seen object categories like people and cars due to dataset bias. The generated pseudo-depth lacks high-frequency details, potentially due to the limitations of neural networks in modeling such functions.	video prediction, diffusion models, timestamp conditioning, pseudo-depth, forecasting
2404.11475 Report	AdaIR: Exploiting Underlying Similarities of Image Restoration Tasks with Adapters	Hao-Wei Chen, Yu-Syuan Xu, Kelvin C. K. Chan, Hsien-Kai Kuo, Chun-Yi Lee, Ming-Hsuan Yang	Existing image restoration approaches typically employ extensive networks specifically trained for designated degradations. Despite being effective, such methods inevitably entail considerable storage costs and computational overheads due to the reliance on task-specific networks. In this work, we go beyond this well-established framework and exploit the inherent commonalities among image restoration tasks. The primary objective is to identify components that are shareable across restoration tasks and augment the shared components with modules specifically trained for individual tasks. Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance. Specifically, a generic restoration network is first constructed through self-supervised pre-training using synthetic degradations. Subsequent to the pre-training phase, adapters are trained to adapt the pre-trained network to specific degradations. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen. We have conducted extensive experiments to validate the effectiveness of AdaIR and analyze the influence of the pre-training strategy on discovering shareable components. Extensive experimental results show that AdaIR achieves outstanding results on multi-task restoration while utilizing significantly fewer parameters (1.9 MB) and less training time (7 hours) for each restoration task. The source codes and trained models will be released.	This paper proposes AdaIR, a novel framework for multi-task image restoration that leverages adapters for efficient adaptation to unseen degradations.	Existing image restoration methods often rely on separate, computationally expensive models for each degradation type. AdaIR aims to improve efficiency by exploiting shareable components across restoration tasks.	AdaIR uses a two-phase training strategy. First, a generic restoration network (using Restormer architecture) is pre-trained with synthetic degradations. Second, lightweight adapters are fine-tuned to adapt the pre-trained model to specific degradation tasks.	AdaIR achieves comparable performance to state-of-the-art multi-task restoration methods like Restormer and PromptIR. AdaIR demonstrates significant reduction in training time (7 hours per task) and trainable parameters (1.9MB) compared to training from scratch. Analysis of pre-training strategies shows that using diverse degradations during pre-training improves performance on downstream tasks.	The performance gap between AdaIR and other methods is smaller on simpler tasks, suggesting potential limitations in handling complex degradation types. Further research could explore different adapter architectures and pre-training schemes to improve performance on highly challenging degradations.	image restoration, multi-task learning, adapter, parameter-efficient tuning, low-level vision
2404.11419 Report	SLAIM: Robust Dense Neural SLAM for Online Tracking and Mapping	Vincent Cartillier, Grant Schindler, Irfan Essa	We present SLAIM - Simultaneous Localization and Implicit Mapping. We propose a novel coarse-to-fine tracking model tailored for Neural Radiance Field SLAM (NeRF-SLAM) to achieve state-of-the-art tracking performance. Notably, existing NeRF-SLAM systems consistently exhibit inferior tracking performance compared to traditional SLAM algorithms. NeRF-SLAM methods solve camera tracking via image alignment and photometric bundle-adjustment. Such optimization processes are difficult to optimize due to the narrow basin of attraction of the optimization loss in image space (local minima) and the lack of initial correspondences. We mitigate these limitations by implementing a Gaussian pyramid filter on top of NeRF, facilitating a coarse-to-fine tracking optimization strategy. Furthermore, NeRF systems encounter challenges in converging to the right geometry with limited input views. While prior approaches use a Signed-Distance Function (SDF)-based NeRF and directly supervise SDF values by approximating ground truth SDF through depth measurements, this often results in suboptimal geometry. In contrast, our method employs a volume density representation and introduces a novel KL regularizer on the ray termination distribution, constraining scene geometry to consist of empty space and opaque surfaces. Our solution implements both local and global bundle-adjustment to produce a robust (coarse-to-fine) and accurate (KL regularizer) SLAM solution. We conduct experiments on multiple datasets (ScanNet, TUM, Replica) showing state-of-the-art results in tracking and in reconstruction accuracy.	SLAIM, a novel coarse-to-fine tracking model for NeRF-SLAM achieving state-of-the-art tracking performance.	Existing NeRF-SLAM systems have inferior tracking performance compared to traditional SLAM algorithms due to the narrow basin of attraction in image alignment and lack of initial correspondences.	Implements a Gaussian pyramid filter on top of NeRF for coarse-to-fine tracking, and introduces a KL regularizer on the ray termination distribution to constrain scene geometry.	Achieves state-of-the-art tracking performance on ScanNet and TUM datasets. Shows superior reconstruction accuracy on Replica dataset compared to previous NeRF-SLAM methods. Demonstrates the effectiveness of the Gaussian Pyramid filter and the custom KL regularizer through ablation studies.	Struggles with reconstructing completely unobserved regions. Performance slightly degrades with Gaussian Pyramid levels beyond a certain threshold.	slam, nerf, 3d reconstruction, tracking, computer vision
2404.11375 Report	Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion	Xinghan Wang, Zixi Kang, Yadong Mu	Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.	This paper introduces a new task called text-based human motion grounding (THMG) and proposes TM-Mamba, a novel state-space model with linear memory cost to address this task.	THMG seeks to locate temporal segments in untrimmed motion sequences matching textual descriptions, which is crucial for real-world applications where actions occur sparsely within long sequences. Existing methods struggle with this due to quadratic memory requirements for handling long sequences.	The authors propose TM-Mamba, which incorporates a text-controlled selection mechanism into the Mamba algorithm, allowing dynamic information propagation based on text queries to extract relevant global context. Additionally, relational embeddings are integrated to model the human skeleton's graph topology. A new dataset, BABEL-Grounding, is also introduced for evaluation.	TM-Mamba outperforms baseline methods, including those adapted from video moment retrieval and those based on SSMs or graph convolutions, demonstrating its effectiveness for THMG. Ablation studies confirm the benefits of the text-controlled selection mechanism, bidirectional modeling, and relational embeddings. Analysis of memory consumption shows that TM-Mamba maintains linear memory usage with increasing sequence length, unlike transformer-based models which quickly run out of memory.	The performance of TM-Mamba, though superior, degrades with increasing sequence length, suggesting further research on handling very long sequences. The current work focuses on single-person motion; future work could explore extending TM-Mamba to multi-person scenarios for grounding actions involving interactions.	human motion analysis, temporal grounding, state space models, mamba, text-motion multi-modal learning
2404.11358 Report	DeblurGS: Gaussian Splatting for Camera Motion Blur	Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, Kyoung Mu Lee	Although significant progress has been made in reconstructing sharp 3D scenes from motion-blurred images, a transition to real-world applications remains challenging. The primary obstacle stems from the severe blur which leads to inaccuracies in the acquisition of initial camera poses through Structure-from-Motion, a critical aspect often overlooked by previous approaches. To address this challenge, we propose DeblurGS, a method to optimize sharp 3D Gaussian Splatting from motion-blurred images, even with the noisy camera pose initialization. We restore a fine-grained sharp scene by leveraging the remarkable reconstruction capability of 3D Gaussian Splatting. Our approach estimates the 6-Degree-of-Freedom camera motion for each blurry observation and synthesizes corresponding blurry renderings for the optimization process. Furthermore, we propose Gaussian Densification Annealing strategy to prevent the generation of inaccurate Gaussians at erroneous locations during the early training stages when camera motion is still imprecise. Comprehensive experiments demonstrate that our DeblurGS achieves state-of-the-art performance in deblurring and novel view synthesis for real-world and synthetic benchmark datasets, as well as field-captured blurry smartphone videos.	DeblurGS: a novel method to reconstruct sharp 3D scenes from motion-blurred images using Gaussian Splatting, addressing the challenge of inaccurate camera pose initialization from SfM.	Existing NeRF-based methods struggle with inaccurate camera poses common in real-world blurry images, limiting their practical application.	Jointly optimizes 3D Gaussian Splatting and camera motion (trajectory and sub-frame alignment) from blurry inputs, utilizing a Gaussian Densification Annealing strategy for robust optimization under noisy pose initialization.	Outperforms state-of-the-art methods in novel view synthesis and deblurring on benchmark datasets. Achieves high-quality deblurring even with noisy camera poses from SfM, unlike previous methods. Demonstrates successful application on real-world blurry videos captured by smartphones.	Assumes a constant amount of blur throughout the exposure time. Future work includes investigating varying blur kernels within a single exposure.	3d gaussian splatting, camera motion deblurring, novel view synthesis, structure-from-motion, blurry image restoration
2404.11207 Report	Exploring the Transferability of Visual Prompting for Multimodal Large Language Models	Yichi Zhang, Yinpeng Dong, Siyuan Zhang, Tianzan Min, Hang Su, Jun Zhu	Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.	The paper proposes Transferable Visual Prompting (TVP), a method for adapting Multimodal Large Language Models (MLLMs) to downstream tasks using transferable visual prompts. This approach aims to improve the performance of various MLLMs with a single set of shared parameters, reducing computation and storage overheads compared to fine-tuning.	Current adaptation methods for MLLMs require individual fine-tuning for each model, leading to significant resource demands. This work aims to develop a more efficient and flexible solution for adapting multiple MLLMs simultaneously.	TVP integrates two key strategies: 1) Feature Consistency Alignment (FCA) to mitigate cross-model feature corruption by aligning prompted features with original features, preserving general knowledge; 2) Task Semantics Enrichment (TSE) to enhance task-specific information in visual prompts by leveraging CLIP's image-text alignment.	TVP effectively improves the performance of 6 diverse MLLMs on 10 datasets across various tasks, including recognition, counting, reasoning, and hallucination correction. TVP demonstrates superior performance compared to existing visual prompting methods (VP and EVP), especially when transferring prompts to unseen models. Model ensembling further enhances the transferability of visual prompts, leading to even greater performance improvements.	Transferability to models with significantly different architectures (e.g., different language models) remains challenging. TVP introduces additional computation overheads for forward passes through vision encoders compared to baseline visual prompting methods, but the increase is relatively small.	multimodal large language models, visual prompting, transferability, parameter-efficient fine-tuning, model adaptation
2404.11151 Report	REACTO: Reconstructing Articulated Objects from a Single Video	Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu	In this paper, we address the challenge of reconstructing general articulated 3D objects from a single video. Existing works employing dynamic neural radiance fields have advanced the modeling of articulated objects like humans and animals from videos, but face challenges with piece-wise rigid general articulated objects due to limitations in their deformation models. To tackle this, we propose Quasi-Rigid Blend Skinning, a novel deformation model that enhances the rigidity of each part while maintaining flexible deformation of the joints. Our primary insight combines three distinct approaches: 1) an enhanced bone rigging system for improved component modeling, 2) the use of quasi-sparse skinning weights to boost part rigidity and reconstruction fidelity, and 3) the application of geodesic point assignment for precise motion and seamless deformation. Our method outperforms previous works in producing higher-fidelity 3D reconstructions of general articulated objects, as demonstrated on both real and synthetic datasets. Project page: https://chaoyuesong.github.io/REACTO.	This paper proposes REACTO, a novel method for reconstructing general articulated 3D objects from single casual videos by employing Quasi-Rigid Blend Skinning (QRBS) and a new rigging system defined on bones.	Existing methods struggle to model the piece-wise rigidity and complex motion of general articulated objects in casual videos, often leading to artifacts and inaccuracies.	REACTO defines a rig on bones for each rigid part and utilizes QRBS to combine the rigidity of Rigid Skinning with the flexibility of Dual Quaternion Blend Skinning. Geodesic distance is employed for precise point assignment to bones or joints.	REACTO outperforms state-of-the-art methods in reconstructing detailed shapes and motions of articulated objects. QRBS effectively models the piece-wise rigidity and smooth deformation on the joints. Defining rig on bones enhances the rigidity and motion integrity of each component compared to defining rig on joints.	Reconstruction quality may degrade on the unseen sides of objects due to partial views in casual videos. Future work could explore extending REACTO to handle more complex object interactions and occlusions.	3d reconstruction, articulated objects, single-view reconstruction, deformation modeling, quasi-rigid blend skinning
2404.11120 Report	TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing	Sherry X. Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, Pradeep Sen	Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text prompts, and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However, these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem, we present TiNO-Edit, an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing, something previously unexplored in the literature. With this simple change, we are able to generate results that both better align with the original images and reflect the desired result. Furthermore, we propose a set of new loss functions that operate in the latent domain of SD, greatly speeding up the optimization when compared to prior approaches, which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.	This paper presents TiNO-Edit, a novel Stable Diffusion-based image editing method that optimizes noise patterns and diffusion timesteps for improved controllability and predictability.	Controllable and predictable image editing with pre-trained text-to-image models remains a challenge, and this method aims to address this by exploring a previously unexplored area.	The method optimizes the noise and timesteps used in the Stable Diffusion denoising process by minimizing a set of loss functions operating in the latent domain.	TiNO-Edit demonstrates robust performance across various image editing tasks, including object replacement, addition, style transfer, stroke-based editing, and image composition. The method outperforms existing baselines in both qualitative and quantitative comparisons, showing better alignment with user intent and image content. By operating in the latent domain, the method offers significant computational advantages over pixel-domain optimization approaches.	The reliance on CLIP for semantic guidance might limit the method's ability to capture complex or nuanced semantic relationships. Further exploration of different optimization strategies and loss functions could potentially enhance the method's performance further.	image editing, stable diffusion, diffusion models, text-to-image synthesis, latent space optimization
2404.11098 Report	LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models	Dingkun Zhang, Sijia Li, Chen Chen, Qingsong Xie, Haonan Lu	In the era of AIGC, the demand for low-budget or even on-device applications of diffusion models emerged. In terms of compressing the Stable Diffusion models (SDMs), several approaches have been proposed, and most of them leveraged the handcrafted layer removal methods to obtain smaller U-Nets, along with knowledge distillation to recover the network performance. However, such a handcrafting manner of layer removal is inefficient and lacks scalability and generalization, and the feature distillation employed in the retraining phase faces an imbalance issue that a few numerically significant feature loss terms dominate over others throughout the retraining process. To this end, we proposed the layer pruning and normalized distillation for compressing diffusion models (LAPTOP-Diff). We, 1) introduced the layer pruning method to compress SDM's U-Net automatically and proposed an effective one-shot pruning criterion whose one-shot performance is guaranteed by its good additivity property, surpassing other layer pruning and handcrafted layer removal methods, 2) proposed the normalized feature distillation for retraining, alleviated the imbalance issue. Using the proposed LAPTOP-Diff, we compressed the U-Nets of SDXL and SDM-v1.5 for the most advanced performance, achieving a minimal 4.0% decline in PickScore at a pruning ratio of 50% while the comparative methods' minimal PickScore decline is 8.2%. We will release our code.	Presents LAPTOP-Diff, a method for compressing Stable Diffusion Models (SDMs) using layer pruning and normalized distillation.	SDMs, while powerful, have high memory consumption and latency, limiting their deployment on resource-constrained devices. Existing compression methods are often handcrafted, inefficient, and lack scalability.	1. Formulates layer pruning as a combinatorial optimization problem and solves it using a one-shot approach with an output loss based pruning criterion. 2. Introduces normalized feature distillation during retraining to alleviate the imbalance issue in feature loss terms.	Achieves state-of-the-art performance, outperforming handcrafted layer removal methods. Demonstrates the effectiveness of the output loss criterion, attributed to its strong additivity property. Shows that normalized feature distillation significantly improves performance compared to vanilla distillation.	The additivity assumption might not hold for other downstream tasks or datasets. Exploring alternative pruning criteria beyond output loss, task loss, and CLIP score.	model compression, diffusion models, layer pruning, knowledge distillation, stable diffusion
2404.10947 Report	Residual Connections Harm Self-Supervised Abstract Feature Learning	Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Willett, Michael Maire	We demonstrate that adding a weighting factor to decay the strength of identity shortcuts within residual networks substantially improves semantic feature learning in the state-of-the-art self-supervised masked autoencoding (MAE) paradigm. Our modification to the identity shortcuts within a VIT-B/16 backbone of an MAE boosts linear probing accuracy on ImageNet from 67.3% to 72.3%. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.	This paper proposes decayed identity shortcuts for residual networks, improving semantic feature learning in self-supervised masked autoencoding.	Residual connections, while good for gradient propagation, can hinder abstract feature learning by injecting shallow representations into deeper layers.	The authors introduce a depth-dependent scaling factor to gradually decrease the weight of identity shortcuts as layer depth increases.	Boosting linear probing accuracy on ImageNet from 67.3% to 72.3% for a VIT-B/16 backbone in an MAE framework. Smaller models with decayed identity shortcuts outperform larger models with standard residual connections (VIT-S/16 outperforms baseline VIT-B/16). Correlation between low effective feature rank and improved downstream task performance is observed.	The optimal decay rate might require tuning for different architectures and datasets. Further theoretical analysis on the relationship between low effective rank and abstract representation learning is needed.	self-supervised learning, masked autoencoding, residual networks, representation learning, low-rank features
2404.10864 Report	Vocabulary-free Image Classification and Semantic Segmentation	Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci	Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.	The paper introduces two novel tasks: Vocabulary-free Image Classification (VIC) and Vocabulary-free Semantic Segmentation (VSS), aiming to classify and segment images without predefined categories.	These tasks are crucial for handling scenarios with unknown or evolving semantic contexts, common in real-world applications like autonomous agents in unconstrained environments.	The proposed method, Category Search from External Databases (CaSED), leverages a pre-trained vision-language model (VLM) and an external database of image captions to extract candidate categories and score them based on multimodal similarity. CaSED is extended to VSS through various strategies, including DenseCaSED, which processes multi-scale image patches with the VLM and performs local category retrieval and scoring.	CaSED and its variants outperform other VLMs in classification benchmarks, achieving higher cluster accuracy, semantic similarity, and semantic IoU. For VSS, CaSED combined with an open-vocabulary segmentation model performs best, while DenseCaSED shows promise despite lacking a dedicated segmentation component. Prompt ensembling consistently improves performance across datasets and tasks.	The effectiveness of CaSED depends on the quality and coverage of the retrieval database. Future work includes addressing label inconsistencies, handling class granularity, and improving the computational efficiency of DenseCaSED.	vision and language, vocabulary-free classification, vocabulary-free segmentation, cased, densecased
2404.10772 Report	Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes	Zehao Yu, Torsten Sattler, Andreas Geiger	Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results, while allowing the rendering of high-resolution images in real-time. However, leveraging 3D Gaussians for surface reconstruction poses significant challenges due to the explicit and disconnected nature of 3D Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of 3D Gaussians, enabling direct geometry extraction from 3D Gaussians by identifying its levelset, without resorting to Poisson reconstruction or TSDF fusion as in previous work. We approximate the surface normal of Gaussians as the normal of the ray-Gaussian intersection plane, enabling the application of regularization that significantly enhances geometry. Furthermore, we develop an efficient geometry extraction method utilizing marching tetrahedra, where the tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based methods in surface reconstruction and novel view synthesis. Further, it compares favorably to, or even outperforms, neural implicit methods in both quality and speed.	Presents Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes using 3D Gaussians.	Addresses limitations of existing 3D Gaussian surface reconstruction methods that struggle with fine-grained geometry, background reconstruction, and rely on computationally expensive or inconsistent post-processing techniques like Poisson reconstruction or TSDF fusion.	1. Establishes a Gaussian opacity field consistent with volume rendering, enabling direct surface extraction via level set identification. 2. Employs ray-Gaussian intersection normals for regularization, enhancing geometry reconstruction. 3. Develops an efficient tetrahedra-based mesh extraction method using 3D Gaussian positions and scales, resulting in compact and adaptive meshes.	GOF outperforms existing 3DGS-based methods in surface reconstruction and novel view synthesis on Tanks and Temples, DTU, and Mip-NeRF 360 datasets. GOF achieves competitive surface reconstruction quality compared to SOTA neural implicit methods while being significantly faster. Ablation studies confirm the effectiveness of GOF's mesh extraction, regularization, and decoupled appearance modeling.	Delaunay triangulation for tetrahedral grid generation poses a computational bottleneck. Opacity evaluation during marching tetrahedra binary search could be optimized.	3d gaussian splatting, surface reconstruction, novel view synthesis, unbounded scenes, mesh extraction
2404.10765 Report	RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting	Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic	Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.	Introduces RefFusion, a novel 3D scene inpainting method using multi-scale personalization of an image inpainting diffusion model, achieving high-quality, controllable inpaintings.	3D scene inpainting is crucial for editing neural scene representations, but existing methods struggle with balancing controllability, detail, and multi-view consistency.	RefFusion adapts an inpainting diffusion model to a reference view, then distills its priors to the 3D scene using a multi-scale score distillation objective. It further leverages Gaussian splatting to isolate masked regions and applies depth and adversarial regularization.	Outperforms previous 3D inpainting methods on the SPIn-NeRF dataset in both quantitative metrics and user studies. Demonstrates superior performance on scenes with large camera motion compared to single-view reference-based approaches. Shows generalization capabilities for object insertion, sparse view reconstruction, and scene outpainting.	Removing large objects covering significant portions of the reference image remains challenging. Personalizing the diffusion model can be time-consuming.	3d inpainting, diffusion models, score distillation sampling, gaussian splatting, neural scene representation
2404.10763 Report	LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?	Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun	Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.	Introduces LaDiC, a novel diffusion-based image captioning model utilizing a split BERT for a dedicated text latent space and a regularization module for variable text lengths, outperforming previous diffusion methods.	Addresses the limitations of auto-regressive models in image captioning, such as slow inference speed, error propagation, and unidirectional constraints, while overcoming the shortcomings of existing diffusion-based methods.	Employs a split BERT to create a text latent space, trains a diffuser to map image representations to this latent space, and uses a Non-Auto-Regressive (NAR) decoder to generate captions. Introduces Back&Refine for improved token interactivity during inference.	Achieves state-of-the-art performance for diffusion-based methods on MS COCO, with 38.2 BLEU@4 and 126.2 CIDEr. Demonstrates faster inference speed compared to auto-regressive models, especially for longer captions. Exhibits flexibility in caption generation, enabling custom generation based on tokens in nearly any position.	Limited exploration of other modalities and pure text generation. Reliance on relatively small model parameters and datasets compared to large-scale autoregressive models.	image captioning, diffusion models, non-autoregressive generation, vision-language models, bert
2404.10716 Report	MOWA: Multiple-in-One Image Warping Model	Kang Liao, Zongsheng Yue, Zhonghua Wu, Chen Change Loy	While recent image warping approaches achieved remarkable success on existing benchmarks, they still require training separate models for each specific task and cannot generalize well to different camera models or customized manipulations. To address diverse types of warping in practice, we propose a Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To further enable dynamic task-aware image warping, we introduce a lightweight point-based classifier that predicts the task type, serving as prompts to modulate the feature maps for better estimation. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model. Extensive experiments demonstrate that our MOWA, which is trained on six tasks for multiple-in-one image warping, outperforms state-of-the-art task-specific models across most tasks. Moreover, MOWA also exhibits promising potential to generalize into unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code will be made publicly available.	This paper proposes MOWA, the first practical multiple-in-one image warping framework that can address various warping tasks within a single model.	Existing image warping approaches require training separate models for each task and lack generalization ability. MOWA tackles these limitations by enabling a single model to handle diverse warping tasks.	MOWA disentangles motion estimation at region and pixel levels using TPS transformation and residual flow. It employs a lightweight point-based classifier for task-type prediction and a prompt learning module for task-aware warping.	MOWA outperforms state-of-the-art task-specific models on most of the six evaluated warping tasks. It exhibits promising generalization to unseen scenes, as demonstrated by cross-domain and zero-shot evaluations. The hierarchical motion estimation and task-aware prompt learning strategy contribute to MOWA's effectiveness in multi-task image warping.	MOWA may struggle with extremely complex image boundaries due to the limited number of control points. Scaling up the input resolution could potentially improve the warping performance, which is left for future work.	image warping, multiple-in-one model, prompt learning, tps transformation, computational photography
2404.10700 Report	Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs	Georgy Perevozchikov, Nancy Mehta, Mahmoud Afifi, Radu Timofte	Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images.	This paper introduces Rawformer, a novel unsupervised Transformer-based method for unpaired raw-to-raw image translation across diverse cameras, enabling generalization of learnable ISPs to unseen cameras without retraining.	Modern smartphone camera ISPs require extensive tuning for each new camera model. Rawformer addresses this by enabling the use of pre-trained neural-based ISPs on new cameras without the need for paired datasets from each new model, simplifying ISP development and reducing costs.	Rawformer utilizes an unsupervised encoder-decoder Transformer architecture with contextual-scale aware downsampler and upsampler blocks for efficient encoding of global and local image information. It also introduces a cross-domain attention-driven discriminator for stable training.	Rawformer achieves state-of-the-art results on raw-to-raw translation benchmarks, significantly outperforming previous methods. The method effectively maps raw images to the target camera's raw space, enabling accurate rendering using neural-based ISPs trained on different camera models. Rawformer demonstrates robust improvement in cross-camera ISP rendering, with only a marginal reduction in accuracy compared to camera-specific ISP models.	The model's inference time, while feasible on GPUs, may be impractical for real-time rendering on devices with limited computational power. Future work will focus on developing lighter models for real-time performance on CPUs, broadening its applicability.	raw image processing, image signal processor (isp), unsupervised learning, domain adaptation, transformers
2404.10690 Report	MathWriting: A Dataset For Handwritten Mathematical Expression Recognition	Philippe Gervais, Asya Fadeeva, Andrii Maksai	We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition.	Introduces MathWriting, the largest online handwritten mathematical expression dataset to date, containing 230k human-written and 400k synthetic samples, along with a benchmark for online and offline HME recognition.	Addresses the lack of large, diverse datasets for handwritten mathematical expression recognition, crucial for advancing research and development in this area.	Collected human-written expressions using an Android app, synthesized expressions by stitching together isolated symbols, normalized LaTeX labels, and split data into train/validation/test sets.	MathWriting significantly expands the size and symbol coverage compared to existing datasets like CROHME23. Benchmark results show superior performance of online recognition models (CTC Transformer, PaLI) over offline methods (OCR). Analysis reveals common recognition errors include character confusion and incorrect subexpression nesting.	Label normalization, while improving model performance, could be further refined for specific applications. Inherent ambiguities in handwritten expressions pose challenges for achieving human-level recognition accuracy.	handwriting recognition, mathematical expressions, dataset, benchmark, latex
2404.10685 Report	Generating Human Interaction Motions in Scenes with Text Control	Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe	We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.	This paper introduces \modelname, a novel text-controlled and scene-aware method for generating human-scene interaction motions based on denoising diffusion models.	Generating realistic and controllable human motion within 3D scenes is crucial for various applications, from gaming to embodied AI. Previous methods struggle to simultaneously offer text controllability and scene awareness with high motion quality, especially for complex interactions.	The approach decomposes the task into navigation and interaction stages, each using a diffusion model with an augmented scene-aware branch. First, a scene-agnostic text-to-motion model is trained on large-scale motion capture data. Then, a separate branch is fine-tuned with scene information (2D floor maps for navigation and 3D object geometry for interaction) using data augmented with realistic interactions in scenes.	The method generates plausible motions that navigate through scenes, avoid obstacles, and interact realistically with objects, all while adhering to textual descriptions. Experiments demonstrate superior goal-reaching accuracy and fewer object penetrations compared to state-of-the-art methods. A user study reveals a preference for interactions generated by \modelname over a leading reinforcement learning approach.	The two-stage navigation approach can lead to inconsistencies between the generated pelvis trajectory and full-body poses. The current method is limited to static objects and a fixed set of actions.	motion synthesis, human-scene interaction, diffusion models, text-to-motion, scene-aware motion generation
2404.10667 Report	VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time	Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo	We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.	This paper introduces VASA, a framework that generates realistic talking face videos from a single static image and a speech audio clip, featuring accurate lip synchronization, expressive facial dynamics, and natural head movements.	Realistic AI-generated talking faces have broad applications in communication, education, healthcare, and beyond, enhancing human-computer interaction and accessibility.	The method constructs an expressive and disentangled face latent space from videos. It then uses a Diffusion Transformer model to generate holistic facial dynamics and head movements in this latent space, conditioned on audio and optional control signals like gaze direction and emotion offset. A face decoder then generates video frames based on these latent motions and the input image.	The method achieves superior audio-lip synchronization, significantly outperforming existing methods. It generates more natural and varied head movements synchronized with the audio compared to previous approaches. VASA produces high-quality videos with realistic facial expressions and subtle nuances like eye blinks and gaze shifts, achieving state-of-the-art video quality scores (FVD).	The method currently only models human regions up to the torso and lacks explicit modeling of non-rigid elements like hair. Incorporating more diverse talking styles and emotions in the training data could further enhance expressiveness and control.	talking face generation, audio-driven animation, diffusion models, latent space representation, visual affective skills
2404.10625 Report	Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks	Florian Barthel, Arian Beckmann, Wieland Morgenstern, Anna Hilsmann, Peter Eisert	NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes.	This paper presents a novel method for synthesizing explicit 3D scenes of human heads from a latent space by combining the advantages of 3D-aware GANs (high quality and representational variety) and 3D Gaussian Splatting (efficient rendering and flexibility).	This approach addresses the limitations of NeRF-based GANs, which are difficult to integrate into 3D modeling environments due to their implicit representations and slow rendering speeds.	The method involves training a sequential decoder network that maps implicit NeRF representations from a pre-trained 3D GAN to explicit 3D Gaussian Splatting attributes (position, color, rotation, scale, and opacity). The decoder leverages the geometric information from the GAN's tri-plane features for position initialization and employs a combination of loss functions for training.	The proposed method achieves high visual similarity between the decoded Gaussian Splatting scenes and the target GAN renderings. It achieves rendering speeds up to 5 times faster than the target GANs with the flexibility of arbitrary rendering resolutions. It enables the application of GAN editing and inversion methods to explicit 3D Gaussian Splatting scenes.	The output fidelity of the method is currently limited by the fidelity of the pre-trained 3D GAN used. The lack of view-dependent spherical harmonics in the decoder can lead to uncanny or blurry eye renderings.	3d gaussian splatting, 3d-aware gans, neural radiance fields, 3d head synthesis, real-time rendering
2404.10618 Report	Private Attribute Inference from Images with Vision-Language Models	Batuhan Tömekçe, Mark Vero, Robin Staab, Martin Vechev	As large language models (LLMs) become ubiquitous in our daily tasks and digital interactions, associated privacy risks are increasingly in focus. While LLM privacy research has primarily focused on the leakage of model training data, it has recently been shown that the increase in models' capabilities has enabled LLMs to make accurate privacy-infringing inferences from previously unseen texts. With the rise of multimodal vision-language models (VLMs), capable of understanding both images and text, a pertinent question is whether such results transfer to the previously unexplored domain of benign images posted online. To investigate the risks associated with the image reasoning capabilities of newly emerging VLMs, we compile an image dataset with human-annotated labels of the image owner's personal attributes. In order to understand the additional privacy risk posed by VLMs beyond traditional human attribute recognition, our dataset consists of images where the inferable private attributes do not stem from direct depictions of humans. On this dataset, we evaluate the inferential capabilities of 7 state-of-the-art VLMs, finding that they can infer various personal attributes at up to 77.6% accuracy. Concerningly, we observe that accuracy scales with the general capabilities of the models, implying that future models can be misused as stronger adversaries, establishing an imperative for the development of adequate defenses.	This paper presents the first investigation into the privacy risks posed by Vision-Language Models (VLMs) inferring personal information from images posted on pseudonymized platforms.	With the increasing adoption of VLMs, their ability to deduce private information from seemingly innocuous images raises significant privacy concerns that challenge current online privacy understandings.	The authors created a dataset of images and annotated them with personal attributes, then tested 7 state-of-the-art VLMs on their ability to infer these attributes. They also developed methods to circumvent safety filters and enhance inference accuracy.	Both proprietary and open-source VLMs demonstrated the ability to infer private attributes from images with high accuracy (up to 77.6%). Current safety filters in VLMs are easily circumvented, even with simple prompt engineering techniques. Inference accuracy is strongly correlated with a model's general capabilities, suggesting future, more powerful models will pose a greater privacy risk.	The dataset used for evaluation, while reflecting real-world data, was not released publicly due to privacy concerns. Future work could focus on developing robust defenses against VLM-based privacy inferences, potentially through user-side and model provider-side mitigations.	privacy, vision-language models, personal attribute inference, safety filters, online privacy
2404.10603 Report	Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences	Seungwook Kim, Kejie Li, Xueqing Deng, Yichun Shi, Minsu Cho, Peng Wang	Leveraging multi-view diffusion models as priors for 3D optimization have alleviated the problem of 3D consistency, e.g., the Janus face problem or the content drift problem, in zero-shot text-to-3D models. However, the 3D geometric fidelity of the output remains an unresolved issue; albeit the rendered 2D views are realistic, the underlying geometry may contain errors such as unreasonable concavities. In this work, we propose CorrespondentDream, an effective method to leverage annotation-free, cross-view correspondences yielded from the diffusion U-Net to provide additional 3D prior to the NeRF optimization process. We find that these correspondences are strongly consistent with human perception, and by adopting it in our loss design, we are able to produce NeRF models with geometries that are more coherent with common sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We demonstrate the efficacy of our approach through various comparative qualitative results and a solid user study.	This paper introduces CorrespondentDream, a method that improves the 3D geometric fidelity of text-to-3D generation by leveraging cross-view correspondences from multi-view diffusion models.	Existing text-to-3D methods often produce models with geometric inconsistencies despite generating realistic 2D views. CorrespondentDream addresses this issue by incorporating 3D geometric priors during the optimization process.	CorrespondentDream extracts features from a pre-trained multi-view diffusion model and computes cross-view correspondences between adjacent NeRF-rendered views. These correspondences are then used to guide the NeRF optimization process and correct geometric inconsistencies.	CorrespondentDream effectively removes 3D geometric errors such as unnatural concavities and missing surfaces, as demonstrated through qualitative results. The method outperforms baseline models in a user study, with participants preferring its output in terms of 3D fidelity and overall quality. Analysis shows that alternating optimization using both Score Distillation Sampling (SDS) loss and cross-view correspondence loss is crucial for the method's effectiveness.	The alternating optimization strategy increases the number of optimization iterations, potentially affecting computational efficiency. The method may struggle with objects containing shiny homogeneous surfaces or repetitive patterns, as it becomes challenging to establish robust correspondences in such cases.	text-to-3d generation, diffusion models, nerf, cross-view correspondence, 3d geometric fidelity
2404.10518 Report	MobileNetV4 -- Universal Models for the Mobile Ecosystem	Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard	We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.	Introduces MobileNetV4 (MNv4), a series of universally efficient architecture designs for mobile devices, featuring the Universal Inverted Bottleneck (UIB) and Mobile MQA, achieving mostly Pareto optimal performance across diverse hardware platforms.	Efficient on-device neural networks are crucial for fast, real-time, and interactive experiences on mobile devices while addressing privacy concerns by avoiding streaming of private data.	Develops UIB block unifying prominent micro-architectures and Mobile MQA optimized for mobile accelerators. Employs a refined two-phase NAS approach for efficient architecture search and introduces a novel distillation technique mixing datasets with different augmentations.	MNv4 models demonstrate mostly Pareto-optimal performance across CPUs, DSPs, GPUs, and specialized accelerators. MNv4-Conv-M achieves over 50% speedup compared to MobileOne-S4 and FastViT-S12 on EdgeTPUs at a similar accuracy level. MNv4-Hybrid-L achieves 87% top-1 accuracy on ImageNet-1K, only a 0.5% drop compared to its teacher model, EfficientNet-L2, despite having 39x less MACs.	MNv4-Hybrid models lack compatibility with DSPs. Future work can explore integrating other state-of-the-art techniques and exploring model scaling for even higher accuracy.	mobilenet, neural architecture search, model efficiency, on-device ai, computer vision
2404.10484 Report	AbsGS: Recovering Fine Details for 3D Gaussian Splatting	Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou	3D Gaussian Splatting (3D-GS) technique couples 3D Gaussian primitives with differentiable rasterization to achieve high-quality novel view synthesis results while providing advanced real-time rendering performance. However, due to the flaw of its adaptive density control strategy in 3D-GS, it frequently suffers from over-reconstruction issue in intricate scenes containing high-frequency details, leading to blurry rendered images. The underlying reason for the flaw has still been under-explored. In this work, we present a comprehensive analysis of the cause of aforementioned artifacts, namely gradient collision, which prevents large Gaussians in over-reconstructed regions from splitting. To address this issue, we propose the novel homodirectional view-space positional gradient as the criterion for densification. Our strategy efficiently identifies large Gaussians in over-reconstructed regions, and recovers fine details by splitting. We evaluate our proposed method on various challenging datasets. The experimental results indicate that our approach achieves the best rendering quality with reduced or similar memory consumption. Our method is easy to implement and can be incorporated into a wide variety of most recent Gaussian Splatting-based methods. We will open source our codes upon formal publication. Our project page is available at: https://ty424.github.io/AbsGS.github.io/	This paper proposes AbsGS, a novel method to recover fine details in 3D Gaussian Splatting by addressing the issue of over-reconstruction, where large Gaussians inadequately represent high-frequency details, leading to blurry rendering.	3D Gaussian Splatting (3D-GS) is a powerful technique for novel view synthesis, but its adaptive density control strategy struggles to accurately represent intricate scenes due to over-reconstruction artifacts.	AbsGS introduces the use of a "homodirectional view-space positional gradient" to guide the densification process. By taking the absolute value of each pixel-wise sub-gradient before summation, this method mitigates "gradient collision," allowing for accurate identification and splitting of large Gaussians in over-reconstructed regions.	AbsGS consistently outperforms baselines like Mip-NeRF360 and Instant-NGP in novel view synthesis quality, as evidenced by higher SSIM, PSNR, and LPIPS scores on datasets like Mip-NeRF 360, Tanks & Temples, and Deep Blending. The method effectively eliminates large Gaussians in over-reconstructed areas, resulting in sharper details and less blurriness compared to 3D-GS, as visualized through point cloud and ellipsoid representations. AbsGS achieves superior results with similar or even reduced memory consumption compared to 3D-GS, demonstrating its efficiency in addressing over-reconstruction without relying on a significantly larger number of Gaussians.	The paper primarily focuses on improving the split operation for densification, leaving the exploration of applying the homodirectional gradient to the clone operation for future work. Further investigation into the impact of scale threshold and gradient threshold on the performance of AbsGS, particularly in relation to different scene complexities, is warranted.	novel view synthesis, 3d gaussian splatting, point-based radiance field, 3d reconstruction, densification strategy
2404.10441 Report	1st Place Solution for ICCV 2023 OmniObject3D Challenge: Sparse-View Reconstruction	Hang Du, Yaping Xue, Weidong Dai, Xuejun Yan, Jingjing Wang	In this report, we present the 1st place solution for ICCV 2023 OmniObject3D Challenge: Sparse-View Reconstruction. The challenge aims to evaluate approaches for novel view synthesis and surface reconstruction using only a few posed images of each object. We utilize Pixel-NeRF as the basic model, and apply depth supervision as well as coarse-to-fine positional encoding. The experiments demonstrate the effectiveness of our approach in improving sparse-view reconstruction quality. We ranked first in the final test with a PSNR of 25.44614.	The paper presents the 1st place solution for the ICCV 2023 OmniObject3D Challenge Track-1 Sparse-View Reconstruction, achieving a PSNR of 25.44614 on the final test set.	The challenge addresses the difficult task of novel view synthesis and surface reconstruction from a limited number of input images (1-3), which has significant implications for various applications.	The solution utilizes a Pixel-NeRF model pre-trained on a curated subset of the OmniObject3D dataset and fine-tuned on each test scene. It incorporates depth supervision and coarse-to-fine positional encoding to improve reconstruction quality.	Training on a representative subset of 48 object categories selected from the OmniObject3D dataset outperforms training on a smaller, less diverse subset. Depth supervision and coarse-to-fine positional encoding further improve the fidelity of surface reconstruction. Fine-tuning the pre-trained model on each test scene significantly boosts performance, highlighting the importance of adapting to scene-specific characteristics.	The study is limited by computational resources, particularly when evaluating different test-time optimization strategies. Future work could explore more advanced techniques for pre-training and fine-tuning NeRF models, as well as investigate alternative network architectures.	sparse-view reconstruction, novel view synthesis, nerf, pixel-nerf, omniobject3d challenge
2404.10438 Report	The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement	Gabriele Trivigno, Carlo Masone, Barbara Caputo, Torsten Sattler	Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g., from retrieval), (2) as pre-processing, i.e., to provide a better starting point to a more expensive pose estimator, (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work, we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity, it achieves state-of-the-art results, demonstrating that one can easily build a pose refiner without the need for specific training. The code is at https://github.com/ga1i13o/mcloc_poseref	This paper introduces a novel pose refinement approach that leverages pre-trained, generic deep features for visual localization, challenging the necessity of specialized features in pose refinement.	Existing pose refinement techniques often rely on computationally expensive and scene-specific feature learning. This work explores the use of readily available pre-trained features for a more efficient and generalizable solution.	The method integrates a pre-trained CNN with a particle filter optimizer within a render-and-compare framework. It utilizes a coarse-to-fine strategy, progressively refining pose estimates by comparing rendered views with query images using features from deeper to shallower layers.	Despite its simplicity, the approach achieves state-of-the-art results on benchmark datasets, demonstrating the effectiveness of generic features for pose refinement. The method proves robust to rendering domain shifts, indicating its applicability across diverse scene representations. It can be seamlessly integrated with existing localization pipelines, either as pre-processing for coarse pose estimation or post-processing for refinement, further enhancing their performance.	The method's performance faces challenges in indoor environments with repetitive, texture-less surfaces, where perceptual similarity cues are limited. Future work can explore incorporating task-specific fine-tuning during test time to potentially further improve accuracy, capitalizing on the strengths of both generic and specialized features.	visual localization, pose refinement, deep features, particle filter, render-and-compare
2404.10394 Report	Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior	Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin	Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360{\deg} canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.	This paper proposes \textit{\ourname}, a text-guided 3D portrait generation framework that leverages 3D-aware GANs to provide robust joint geometry-appearance prior information.	Existing neural rendering-based text-to-3D-portrait generation methods often lead to issues like inconsistent textures, over-saturation, and over-smoothing due to relying solely on geometry priors.	The method trains a 3D portrait generator (\ourgenerator) with a novel \textit{pyramid tri-grid} 3D representation to alleviate artifacts. For text-to-3D generation, it first projects a randomly generated image (aligned with the text prompt) to \ourgenerator's latent space. The resulting latent code synthesizes a \textit{pyramid tri-grid}, which is then refined via score distillation sampling and a diffusion model.	Generates high-quality, realistic 3D portraits consistent with text prompts. Successfully mitigates issues like over-saturation and the Janus problem. Demonstrates superior performance compared to state-of-the-art methods in both qualitative and quantitative evaluations.	Generated portraits may exhibit distortions if the initial inversion from the 3D portrait generator is not perfectly canonical. Semantic attributes of the background in the input prompt can sometimes influence the final 3D portrait results.	3d portrait generation, 3d-aware gans, diffusion models, neural rendering, text-to-3d synthesis
2404.10342 Report	Referring Flexible Image Restoration	Runwei Guan, Rongsheng Hu, Zhuhao Zhou, Tianlang Xue, Ka Lok Man, Jeremy Smith, Eng Gee Lim, Weiping Ding, Yutao Yue	In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image restoration, where a model must perceive and remove specific degradation types specified by human commands in images with multiple degradations. We term this task Referring Flexible Image Restoration (RFIR). To address this, we first construct a large-scale synthetic dataset called RFIR, comprising 153,423 samples with the degraded image, text prompt for specific degradation removal and restored image. RFIR consists of five basic degradation types: blur, rain, haze, low light and snow while six main sub-categories are included for varying degrees of degradation removal. To tackle the challenge, we propose a novel transformer-based multi-task model named TransRFIR, which simultaneously perceives degradation types in the degraded image and removes specific degradation upon text prompt. TransRFIR is based on two devised attention modules, Multi-Head Agent Self-Attention (MHASA) and Multi-Head Agent Cross Attention (MHACA), where MHASA and MHACA introduce the agent token and reach the linear complexity, achieving lower computation cost than vanilla self-attention and cross-attention and obtaining competitive performances. Our TransRFIR achieves state-of-the-art performances compared with other counterparts and is proven as an effective architecture for image restoration. We release our project at https://github.com/GuanRunwei/FIR-CP.	This paper introduces Referring Flexible Image Restoration (RFIR), a novel task focused on removing specific image degradations based on user-provided text prompts.	Current image restoration models struggle to selectively remove degradations in images with multiple degradation types. RFIR aims to address this limitation by enabling user-controlled, flexible restoration according to specific preferences.	The authors create a large-scale synthetic dataset, RFIR, containing images with single, double, and triple degradations along with text prompts specifying which degradation(s) to remove. They also propose TransRFIR, a multi-task transformer-based model, that simultaneously predicts degradation types and performs text-guided image restoration. TransRFIR utilizes novel, computationally efficient attention modules, MHASA and MHACA.	TransRFIR achieves state-of-the-art performance on the RFIR dataset, outperforming adapted task-agnostic, all-in-one, and text-driven models. The proposed MHASA and MHACA modules demonstrate both computational efficiency and effectiveness for self-attention and feature fusion, respectively. The TransRFIR pipeline exhibits good generalization capabilities, achieving competitive results on benchmark datasets for deblurring, deraining, low-light enhancement, and dehazing.	The current pipeline is primarily designed for U-Net-based architectures and may not be directly applicable to GAN-based or diffusion-based models. The RFIR dataset is synthetic, potentially limiting its ability to fully represent real-world degradation complexities.	referring flexible image restoration, multi-modal learning, cross attention, prompt learning, image restoration
2404.10318 Report	SRGS: Super-Resolution 3D Gaussian Splatting	Xiang Feng, Yongbo He, Yubo Wang, Yan Yang, Zhenzhong Kuang, Yu Jun, Jianping Fan, Jiajun ding	Recently, 3D Gaussian Splatting (3DGS) has gained popularity as a novel explicit 3D representation. This approach relies on the representation power of Gaussian primitives to provide a high-quality rendering. However, primitives optimized at low resolution inevitably exhibit sparsity and texture deficiency, posing a challenge for achieving high-resolution novel view synthesis (HRNVS). To address this problem, we propose Super-Resolution 3D Gaussian Splatting (SRGS) to perform the optimization in a high-resolution (HR) space. The sub-pixel constraint is introduced for the increased viewpoints in HR space, exploiting the sub-pixel cross-view information of the multiple low-resolution (LR) views. The gradient accumulated from more viewpoints will facilitate the densification of primitives. Furthermore, a pre-trained 2D super-resolution model is integrated with the sub-pixel constraint, enabling these dense primitives to learn faithful texture features. In general, our method focuses on densification and texture learning to effectively enhance the representation ability of primitives. Experimentally, our method achieves high rendering quality on HRNVS only with LR inputs, outperforming state-of-the-art methods on challenging datasets such as Mip-NeRF 360 and Tanks & Temples. Related codes will be released upon acceptance.	Introduces Super-Resolution 3D Gaussian Splatting (SRGS) for high-resolution novel view synthesis (HRNVS) using only low-resolution (LR) inputs.	Addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, which struggle to render high-resolution details and textures when trained on LR images.	Employs a two-pronged strategy: (1) Super-Resolution Gaussian Densification increases the density of Gaussian primitives in HR space through super-splatting and sub-pixel constraints. (2) Texture-Guided Gaussian Learning leverages a pre-trained 2D super-resolution model to guide Gaussian primitives in learning faithful texture features, while sub-pixel constraints ensure spatial consistency.	Significantly improves rendering quality on HRNVS tasks compared to baseline 3DGS and other state-of-the-art methods. Achieves high PSNR, SSIM, and LPIPS scores on benchmark datasets, including Synthetic NeRF, Tanks & Temples, and Mip-NeRF 360. Effectively reconstructs fine-grained details and textures, even with large super-resolution factors (e.g., 4x and 8x).	Reliance on a 2D super-resolution model, which might introduce limitations based on the model's capabilities. Future work could explore HRNVS without relying on a 2D super-resolution model.	3d gaussian splatting, super-resolution, novel view synthesis, texture synthesis, gaussian densification
2404.10267 Report	OneActor: Consistent Character Generation via Cluster-Conditioned Guidance	Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang	Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.	This paper proposes OneActor, a novel one-shot tuning paradigm for consistent character generation in text-to-image diffusion models, achieving faster and more efficient results compared to existing methods.	Existing methods for consistent character generation in text-to-image synthesis either rely on external data or require time-consuming fine-tuning of the entire diffusion model, limiting their practicality. This paper addresses these limitations.	The authors formalize consistent generation mathematically, derive a cluster-based score function, and introduce a cluster-conditioned model. They utilize semantic representations of target and auxiliary images to guide the denoising process towards a desired character cluster while maintaining diversity.	OneActor achieves superior character consistency and prompt conformity compared to baseline methods, establishing a new Pareto front. The method maintains high image quality and diversity without compromising the original diffusion model's capabilities. OneActor significantly reduces tuning time, requiring only 3-8 minutes compared to 20-60 minutes for existing methods.	The study primarily focuses on character-centric generation, and further research is needed to extend its applicability to other domains. Future work could explore alternative clustering methods or incorporate user feedback for improved control over character generation.	text-to-image synthesis, diffusion models, consistent character generation, semantic control, one-shot tuning
2404.10157 Report	Salient Object-Aware Background Generation using Text-Guided Diffusion Models	Amir Erfan Eshratifar, Joao V. B. Soares, Kapil Thadani, Shaunak Mishra, Mikhail Kuznetsov, Yueh-Ning Ku, Paloma de Juan	Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.	This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task, focusing on generating natural backgrounds while preserving the object's identity and boundaries.	Salient object outpainting is crucial for applications like e-commerce and design, enabling personalized backgrounds and enhancing visual presentation.	The proposed model leverages Stable Diffusion and ControlNet architectures. It utilizes a salient object mask as input to ControlNet, guiding the inpainting process to maintain object boundaries and prevent unwanted modifications.	Reduces object expansion by 3.6x compared to Stable Diffusion 2.0 Inpainting. Achieves comparable or superior performance on standard visual metrics (FID, LPIPS). Demonstrates effectiveness across different datasets and types of text prompts.	Reliance on synthetic captions for some training data may impact prompt alignment. Background diversity can be further improved.	image outpainting, salient objects, diffusion models, controlnet, object expansion
2404.09995 Report	Taming Latent Diffusion Model for Neural Radiance Field Inpainting	Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, Hung-Yu Tseng	Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: https://hubert0527.github.io/MALD-NeRF	This paper introduces MALD-NeRF, a novel framework for NeRF inpainting that utilizes latent diffusion models and masked adversarial training to generate high-quality novel views with realistic inpainted regions.	Existing NeRF inpainting methods struggle to synthesize reasonable geometry and textures in completely uncovered regions due to the high diversity of synthetic content from diffusion models and textural inconsistencies with real data.	The proposed method employs a per-scene customized latent diffusion model for 2D image inpainting and a masked adversarial training scheme during NeRF optimization to address 3D and textural inconsistencies. It also utilizes iterative dataset updates and partial DDIM for improved convergence.	MALD-NeRF achieves state-of-the-art NeRF inpainting performance on SPIn-NeRF and LLFF datasets, outperforming existing methods in both qualitative and quantitative evaluations. The per-scene customization effectively guides the latent diffusion model to generate consistent and in-context contents across different viewpoints. The masked adversarial training scheme proves crucial in enhancing the visual quality and reducing texture discrepancies between the reconstructed and inpainted regions.	The performance of MALD-NeRF can be unstable due to the inherent stochasticity of adversarial training. The method may not generalize well to low-shot NeRF reconstructions or scenarios with large inpainting masks. Future work could explore strategies to further reduce the blurriness of the inpainted textures compared to real-world textures.	nerf, neural radiance fields, 3d inpainting, latent diffusion model, adversarial training
2404.09990 Report	HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie	This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs featuring input and output images with detailed text prompts, followed by precise alignment ensured through post-processing. In addition, we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For example, an HQ-Edit finetuned InstructPix2Pix can attain state-of-the-art image editing performance, even surpassing those models fine-tuned with human-annotated data. The project page is https://thefllood.github.io/HQEdit_web.	Introduces HQ-Edit, a high-quality dataset for instruction-based image editing, featuring ~200,000 edits with high-resolution images and detailed prompts, generated using GPT-4V and DALL-E 3.	Existing datasets for instruction-based image editing lack high-quality, high-resolution images paired with detailed editing instructions, hindering the training of robust editing models.	A three-stage pipeline: 1) Expansion: Seed image/instruction triplets are expanded using GPT-4. 2) Generation: Expanded triplets are refined by GPT-4 into prompts for DALL-E 3 to generate image diptychs. 3) Post-processing: Diptychs are split, aligned, and instructions are refined with GPT-4V.	HQ-Edit significantly outperforms existing datasets in Alignment and Coherence, two proposed metrics for evaluating image-edit instruction alignment. Fine-tuning InstructPix2Pix on HQ-Edit surpasses models trained on human-annotated datasets, demonstrating its high quality. The proposed Alignment metric shows stronger correlation with human preference than CLIP Directional Similarity.	Reliance on DALL-E 3 API limits prompt control and potential diversity. Future work could explore user-interactive editing and generating edits across multiple images.	image editing, generative models, dataset, gpt-4v, dall-e 3
2404.09977 Report	MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models	Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M Patel	Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation as well as spatially conditioned image generation. For most applications, we can train the model end-toend with paired data to obtain photorealistic generation quality. However, to add an additional task, one often needs to retrain the model from scratch using paired data across all modalities to retain good generation performance. In this paper, we tackle this issue and propose a novel strategy to scale a generative model across new tasks with minimal compute. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the intensity of conditioning. Utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models to accommodate new modality conditions. Specifically, we combine aligned features of multiple models, hence bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess.	Proposes MaxFusion, a training-free method to scale text-to-image diffusion models for multi-modal generation by leveraging variance maps of intermediate features to fuse information from single-task models.	Addresses the limitations of retraining diffusion models for multi-modal generation, which is data-intensive and prone to catastrophic forgetting.	Analyzes variance maps of diffusion models to estimate conditioning intensity and uses this information to fuse features from different models based on correlation and variance.	Enables zero-shot multi-modal generation by combining information from models trained on separate tasks. Outperforms single-modal and multi-modal baselines (SPADE, PITI, T2I-Adapter, ControlNet) in qualitative and quantitative evaluations. Demonstrates scalability beyond two modalities and effectiveness with both spatial and style conditioning.	Inherits limitations of Stable Diffusion (e.g., generating hands and faces) and may exhibit discrepancies with semantic maps. A trade-off between conditioning strength and sampling fidelity arises as the number of modalities increases.	multimodal generation, diffusion models, text-to-image synthesis, zero-shot learning, feature fusion
2404.09976 Report	Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers	Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M. Patel	Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.	This paper proposes DiffScaler, a novel scaling strategy for diffusion models that enables a single pre-trained model to be adapted to various image generation tasks and datasets with minimal parameter tuning.	Current diffusion models are typically trained separately for each dataset or task, demanding significant computational resources and potentially leading to catastrophic forgetting when fine-tuned. DiffScaler addresses this by allowing a single model to handle diverse tasks effectively.	DiffScaler introduces a lightweight module called 'Affiner' to each trainable layer of the diffusion model. Affiner learns task-specific transformations by scaling and shifting weights and biases of existing subspaces while also having the capability to learn additional task-specific subspaces. By training only these Affiner parameters, DiffScaler enables efficient adaptation to new tasks and datasets.	DiffScaler achieves high-quality image generation across diverse datasets (FFHQ, Oxford Flowers, CUB-200, Caltech-101) and for various conditional generation tasks (using Canny edges, HED, depth, and segmentation maps). The method demonstrates superior performance compared to existing efficient fine-tuning techniques like DiffFit and LORA, particularly in high-resolution image generation. Experiments show that transformer-based diffusion backbones adapt better than CNN-based models to smaller datasets during parameter-efficient fine-tuning.	The paper primarily focuses on image generation tasks, leaving exploration for other applications as future work. Potential misuse of the technology for generating harmful content needs careful consideration.	diffusion models, transformers, parameter efficient finetuning, conditional image generation, unconditional image generation
2404.09884 Report	Map-Relative Pose Regression for Visual Re-Localization	Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann	Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods, absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy, they require vast amounts of training data that, realistically, can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression, map-relative pose regression (marepo), that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo	This paper introduces \marepo, a novel absolute pose regression (APR) approach for visual relocalization that achieves state-of-the-art accuracy by leveraging a scene-agnostic map-relative pose regressor conditioned on a scene-specific metric map representation.	Existing APR methods often suffer from low accuracy due to limited training data and struggle to generalize to unseen scenes. \marepo addresses these limitations by training a generic pose regressor on a large dataset of scene coordinates, allowing for fast adaptation to new scenes with high accuracy.	\marepo consists of two main components: (1) a scene-specific geometry prediction network that predicts 3D scene coordinates for an input image and (2) a scene-agnostic map-relative pose regressor that takes the predicted coordinates and estimates the camera pose. The pose regressor is trained on a large dataset of scene coordinates and can generalize to new scenes after a short fine-tuning step.	\marepo outperforms previous APR methods on the indoor 7-Scenes dataset and the outdoor Wayspots dataset, achieving accuracy comparable to structure-based methods. The method exhibits fast mapping times (minutes) compared to traditional APR approaches (hours or days). The proposed architecture, featuring a transformer-based regressor with dynamic positional encoding, effectively leverages 3D geometric information for accurate and robust pose estimation.	The reliance on a separate scene-specific coordinate regression network introduces an additional training step, albeit a quick one. While the scene-agnostic nature of the pose regressor shows strong generalization, its performance may vary depending on the quality of the input scene coordinates.	visual relocalization, pose regression, scene coordinate regression, transformers, deep learning
2404.09833 Report	Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video	Hongchi Xia, Zhi-Hao Lin, Wei-Chiu Ma, Shenlong Wang	Creating high-quality and interactive virtual environments, such as games and simulators, often involves complex and costly manual modeling processes. In this paper, we present Video2Game, a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments. At the heart of our system are three core components:(i) a neural radiance fields (NeRF) module that effectively captures the geometry and visual appearance of the scene; (ii) a mesh module that distills the knowledge from NeRF for faster rendering; and (iii) a physics module that models the interactions and physical dynamics among the objects. By following the carefully designed pipeline, one can construct an interactable and actionable digital replica of the real world. We benchmark our system on both indoor and large-scale outdoor scenes. We show that we can not only produce highly-realistic renderings in real-time, but also build interactive games on top.	Video2Game: a novel approach that automatically transforms real-world videos into interactive and realistic game environments.	Creating realistic and interactive virtual environments is crucial for immersive experiences but traditionally involves complex and costly manual modeling.	The system uses three core components: 1) a neural radiance fields (NeRF) module to capture scene geometry and appearance; 2) a mesh module for efficient rendering; 3) a physics module to model object interactions.	The approach produces high-fidelity renderings in real-time, enabling interactive experiences. The system supports object-level interaction through scene decomposition and rigid-body physics. The generated environments are compatible with game engines and run smoothly in web browsers.	The system currently doesn't model physics-informed relighting, such as simulating object's metallic properties. Creating unbounded, relightable scenes from single videos remains an open challenge for future work.	neural rendering, nerf, video game development, physics simulation, interactive environments
2404.09732 Report	Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models	Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön	Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.	This paper introduces a new method for photo-realistic image restoration in the wild using a degradation-aware CLIP model and a synthetic degradation pipeline.	Existing diffusion models for image restoration are often trained on specific datasets and struggle to generalize to real-world images with complex and unknown degradations. This work aims to improve the robustness and generalization ability of these models.	The authors propose a new synthetic image degradation pipeline with diverse degradations and a random shuffle strategy. They also introduce a robust degradation-aware CLIP (DACLIP) model that minimizes the embedding distance between low-quality and high-quality image pairs. Additionally, they present an optimal posterior sampling approach for the IR-SDE model to enhance image generation.	The proposed method achieves state-of-the-art performance on both synthetic and real-world image restoration benchmarks. The introduced degradation pipeline effectively simulates complex real-world degradations, improving model generalization. The optimal posterior sampling strategy significantly enhances the performance of unified image restoration by improving the efficiency of the reverse diffusion process.	The model's performance heavily relies on the quality of the synthetic degradation pipeline and its ability to represent real-world degradations. Further research is needed to explore the use of larger and more powerful vision-language models for improved guidance in image restoration.	image restoration, diffusion models, vision-language models, clip, synthetic data
2404.09632 Report	Bridging Vision and Language Spaces with Assignment Prediction	Jungin Park, Jiyoung Lee, Kwanghoon Sohn	This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.	This paper introduces VLAP, a novel approach that bridges pretrained vision models and frozen large language models (LLMs) for visual understanding, using a single linear layer to map visual embeddings into the LLMs' word embedding space.	Bridging the gap between independently pretrained vision and language models is crucial for efficient and general-purpose visual and language understanding without the high cost of training large multimodal models from scratch.	VLAP utilizes an optimal transport-based assignment prediction objective. It assigns both visual and text representations to a set of word embeddings within pretrained LLMs, enforcing consistent assignments for paired multimodal data. This grounds the LLMs' word embedding space in visual data, allowing the LLMs to interpret visual inputs.	VLAP achieves substantial improvements over previous linear transformation-based approaches on image captioning, significantly outperforming methods like LiMBeR. VLAP demonstrates strong performance on visual question answering, surpassing previous methods in both zero-shot and few-shot settings. VLAP excels in cross-modal retrieval tasks, achieving competitive results on image-and-text-to-text retrieval and outperforming prior works on text-to-image retrieval.	While computationally efficient, VLAP still lags behind modular-based methods (e.g., Flamingo, BLIP-2) in terms of performance, potentially due to the limited capacity of a single linear layer and smaller training datasets. Future work could explore scaling VLAP with modular-based models and larger multimodal datasets to further enhance performance.	vision-language models, large language models, optimal transport, zero-shot learning, cross-modal retrieval
2404.09619 Report	UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark	Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang	As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.	The paper introduces Unified Multi-modal Image Aesthetic Assessment (UNIIAA), a framework designed to enhance and evaluate the visual aesthetic capabilities of Multi-modal Large Language Models (MLLMs).	Existing Image Aesthetic Assessment (IAA) methods are limited to single datasets or tasks, hindering their universality. UNIIAA aims to align with human aesthetic processes and integrate diverse aesthetic data for holistic image evaluation.	The authors propose a novel IAA Datasets Conversion Paradigm (IDCP) to transform existing datasets into MLLM-compatible formats. They introduce UNIIAA-Model, an MLLM fine-tuned on converted aesthetic data, and UNIIAA-Bench, a benchmark to evaluate aesthetic perception, description, and assessment abilities of MLLMs.	UNIIAA-Model achieves superior performance compared to other MLLMs on UNIIAA-Bench across aesthetic perception, description, and assessment tasks. IDCP effectively converts existing aesthetic datasets, leading to significant improvement in MLLMs' aesthetic capabilities. Despite progress, MLLMs still lag behind human experts in visual aesthetics, highlighting the need for further research.	The converted IDCP dataset primarily comprises natural images, limiting the model's generalization to other image types like artistic works or AI-generated content. Evaluating aesthetic description remains subjective, and while a 5-round GPT-assisted protocol is used, potential hallucinations from GPT might affect evaluation accuracy.	image aesthetics assessment, multi-modal large language model, instruct tuning, benchmarking, visual aesthetics
2404.09591 Report	3D Gaussian Splatting as Markov Chain Monte Carlo	Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Weiwei Sun, Jeff Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi	While 3D Gaussian Splatting has recently become popular for neural rendering, current methods rely on carefully engineered cloning and splitting strategies for placing Gaussians, which does not always generalize and may lead to poor-quality renderings. In addition, for real-world scenes, they rely on a good initial point cloud to perform well. In this work, we rethink 3D Gaussians as random samples drawn from an underlying probability distribution describing the physical representation of the scene -- in other words, Markov Chain Monte Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian updates are strikingly similar to a Stochastic Langevin Gradient Descent (SGLD) update. As with MCMC, samples are nothing but past visit locations, adding new Gaussians under our framework can simply be realized without heuristics as placing Gaussians at existing Gaussian locations. To encourage using fewer Gaussians for efficiency, we introduce an L1-regularizer on the Gaussians. On various standard evaluation scenes, we show that our method provides improved rendering quality, easy control over the number of Gaussians, and robustness to initialization.	Reformulates 3D Gaussian Splatting (3DGS) as a Markov Chain Monte Carlo (MCMC) sampling process using Stochastic Gradient Langevin Dynamics (SGLD), removing the reliance on heuristics for Gaussian placement.	Current 3DGS methods rely on engineered heuristics for Gaussian placement, leading to suboptimal results and requiring careful tuning. This work aims to develop a more principled and robust approach.	The authors reinterpret Gaussians as samples from a distribution representing the 3D scene. They then reformulate the 3DGS update rule as an SGLD update, enabling a more natural and theoretically grounded exploration of the scene.	Achieves improved rendering quality compared to conventional 3DGS, especially with random Gaussian initialization. Demonstrates robustness to initialization, eliminating the need for a good initial point cloud. Provides easy control over the number of Gaussians used through L1 regularization on opacity and scale.	The method's performance with a very limited number of Gaussians is not explored. Future work could investigate extensions to dynamic scenes.	3d gaussian splatting, neural rendering, markov chain monte carlo, stochastic gradient langevin dynamics, novel view synthesis
2404.09570 Report	The revenge of BiSeNet: Efficient Multi-Task Image Segmentation	Gabriele Rosi, Claudia Cuttano, Niccolò Cavagnero, Giuseppe Averta, Fabio Cermelli	Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.	The paper proposes \architecture, a novel, efficient architecture for multi-task image segmentation that leverages two-stream semantic segmentation and mask classification.	Existing image segmentation models are either task-specific (limiting their application) or computationally intensive (hindering real-time performance). This work bridges this gap.	\architecture uses a spatial path for detailed information and a context path for semantic information. It then employs a transformer decoder and segmentation head to generate binary masks and class probabilities for each segment.	\architecture achieves impressive inference speeds (up to 100 FPS) while maintaining competitive accuracy compared to state-of-the-art models. It demonstrates strong performance on both semantic and panoptic segmentation tasks on Cityscapes and ADE20K datasets. The architecture exhibits remarkable adaptability across various hardware, including edge devices like Jetson ORIN.	While \architecture excels in speed, it shows a slight performance drop in panoptic segmentation on ADE20K, warranting further investigation and optimization. Future work will focus on refining \architecture and exploring its application on additional tasks and datasets.	image segmentation, multi-task learning, efficient architectures, real-time segmentation, mask classification
2404.09512 Report	Magic Clothing: Controllable Garment-Driven Image Synthesis	Weifeng Chen, Tao Gu, Yuhao Xu, Chengcai Chen	We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.	Presents Magic Clothing, an LDM-based architecture for garment-driven image synthesis, enabling character generation with specific garments and text prompts.	Addresses the unexplored task of garment-driven image synthesis, crucial for e-commerce and metaverse, with a focus on preserving garment details and text prompt fidelity.	Introduces a garment extractor to capture detailed features, fused into pretrained LDMs via self-attention. Employs joint classifier-free guidance to balance garment and text control. Proposes MP-LPIPS metric for robust evaluation.	Outperforms state-of-the-art subject-driven methods in preserving garment details and text prompt adherence. Demonstrates high controllability by seamlessly integrating with various finetuned LDMs and extensions like ControlNet and IP-Adapter. Proposes a robust metric MP-LPIPS for evaluating garment consistency while mitigating the influence of pose and background.	Image quality relies on the base diffusion model, improvement possible with stronger pretrained models. Limited training data may hinder performance on complex garments, necessitating more comprehensive datasets.	image synthesis, latent diffusion models, garment-driven, controllable generation, virtual try-on
2404.09504 Report	Learning Tracking Representations from Single Point Annotations	Qiangqiang Wu, Antoni B. Chan	Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.	This paper proposes SoCL, a soft contrastive learning framework, to learn tracking representations from single point annotations instead of expensive bounding boxes.	Bounding box annotations are expensive and time-consuming. Point annotations are significantly faster and cheaper to obtain, enabling efficient training of deep trackers.	SoCL leverages a target objectness prior (TOP) map to generate soft templates and negative samples for contrastive learning. It uses global and local soft templates to represent the target, and generates hard negative samples from the background for discrimination.	SoCL-Siam, using SoCL representations, achieves comparable performance to fully supervised baseline (Box-Siam) trained with bounding boxes on GOT-10k, while reducing annotation time by 78%. Under the same annotation time budget, SoCL-Siam consistently outperforms Box-Siam on various benchmarks. SoCL-TransT, trained with a hybrid annotation scheme using SoCL, achieves state-of-the-art performance with significantly lower annotation cost compared to other trackers.	The impact of using a projection head in SoCL for different tracker architectures (Siamese vs. CF) is not fully understood. Future work could explore more sophisticated methods to generate pseudo bounding boxes from point annotations for improved scale estimation.	visual tracking, weakly supervised learning, contrastive learning, point annotation, siamese tracker
2404.09502 Report	SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction	Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma	Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.	This paper proposes SparseOcc, an efficient occupancy network for autonomous driving that leverages a lossless sparse latent representation, inspired by sparse point cloud processing, to reduce computational cost without sacrificing accuracy.	Operating on dense 3D latent spaces for vision-based perception in autonomous driving is computationally expensive, and existing compression methods like BEV and TPV result in information loss. SparseOcc addresses this by using a sparse representation, enabling efficiency and potentially higher accuracy.	SparseOcc utilizes a 3D sparse diffuser with spatially decomposed convolutional kernels for efficient latent completion. It incorporates a sparse feature pyramid with interpolation for multi-scale feature enhancement and employs a sparse transformer head for occupancy prediction, focusing on occupied voxels.	SparseOcc achieves a 74.9% reduction in FLOPs compared to dense baselines on nuScenes-Occupancy. It outperforms state-of-the-art methods on nuScenes-Occupancy, achieving 21.8% IoU and 14.1% mIoU. The sparse representation naturally avoids hallucinations on empty voxels, potentially contributing to improved accuracy.	The improvement in IoU with higher image resolution is limited due to potential hallucinations on empty voxels caused by dense features. Further investigation is needed to address the hallucination issue and explore the application of SparseOcc in dynamic scenarios.	autonomous driving, occupancy prediction, sparse representation, 3d vision, deep learning
2404.09476 Report	FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining	Zou Zhen, Yu Hu, Zhao Feng	Images corrupted by rain streaks often lose vital frequency information for perception, and image deraining aims to solve this issue which relies on global and local degradation modeling. Recent studies have witnessed the effectiveness and efficiency of Mamba for perceiving global and local information based on its exploiting local correlation among patches, however, rarely attempts have been explored to extend it with frequency analysis for image deraining, limiting its ability to perceive global degradation that is relevant to frequency modeling (e.g. Fourier transform). In this paper, we propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining. The core of our method lies in extending Mamba with frequency analysis from two perspectives: extending it with frequency-band for exploiting frequency correlation, and connecting it with Fourier transform for global degradation modeling. Specifically, FreqMamba introduces complementary triple interaction structures including spatial Mamba, frequency band Mamba, and Fourier global modeling. Frequency band Mamba decomposes the image into sub-bands of different frequencies to allow 2D scanning from the frequency dimension. Furthermore, leveraging Mamba's unique data-dependent properties, we use rainy images at different scales to provide degradation priors to the network, thereby facilitating efficient training. Extensive experiments show that our method outperforms state-of-the-art methods both visually and quantitatively.	Presents FreqMamba, a novel image deraining network that integrates spatial domain sequence modeling with frequency domain global modeling through a unique Frequency-SSM block and a multi-scale degradation prior attention mechanism.	Image deraining is crucial for improving visual quality and the performance of computer vision tasks, and existing methods often struggle to effectively handle both global and local degradation caused by rain.	FreqMamba employs a three-branch architecture: spatial Mamba for local detail extraction, frequency band Mamba for bridging spatial and frequency domains, and Fourier modeling for global degradation correction. It also uses Mamba's data-dependent property to generate attention maps from multi-scale input for guiding degradation-aware training.	FreqMamba achieves state-of-the-art performance on benchmark datasets, outperforming existing methods in terms of both PSNR and SSIM. The method effectively removes rain streaks while preserving scene details and fidelity, as demonstrated by visual comparisons. FreqMamba demonstrates versatility and strong generalization ability by extending to other image restoration tasks like low-light image enhancement and real-world image dehazing.	The current model primarily focuses on single-image deraining and could be extended to video deraining in future work. Exploring alternative frequency analysis techniques beyond Fourier and wavelet transforms might further enhance performance.	image deraining, frequency analysis, state space model, deep learning, computer vision
2404.09469 Report	Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?	Dmitry Ignatov, Andrey Ignatov, Radu Timofte	We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU	This paper introduces ANYU, a virtually augmented version of the NYU Depth V2 dataset designed for monocular depth estimation. ANYU is created by incorporating randomly generated virtual 3D objects into real-world images, enhancing training data diversity without relying solely on full virtual scenes.	The standard NYU Depth V2 dataset, despite its popularity, suffers from depth map inaccuracies and limited training data diversity. ANYU aims to address these limitations by introducing virtual objects, leading to more accurate depth values and improved model generalization.	ANYU leverages a game engine to generate virtual 3D objects. These objects are randomly assigned textures, placed randomly within real NYU Depth V2 images, and rendered with varying lighting and shadow parameters, maximizing data diversity.	Augmenting the NYU Depth V2 dataset with ANYU consistently improves depth prediction metrics for different model architectures, including the state-of-the-art VPD model. The benefits of ANYU are particularly pronounced when training data is limited, highlighting the importance of diversity in smaller datasets. Models trained on ANYU exhibit improved generalization, as demonstrated by cross-dataset validation on the iBims-1 benchmark.	The rendering quality of virtual objects might not fully match real-world objects, potentially limiting performance gains at very high augmentation levels. Future work could explore more sophisticated methods for integrating virtual objects, such as aligning them with the scene context or using more realistic rendering techniques.	monocular depth estimation, data augmentation, virtual reality, dataset, nyu depth v2
2404.09465 Report	PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI	Yandan Yang, Baoxiong Jia, Peiyuan Zhi, Siyuan Huang	With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: http://physcene.github.io.	Introduces PHYSCENE, a guided diffusion model for creating interactive 3D scenes with realistic layouts, articulated objects, and strong adherence to physical constraints.	Addresses the growing need in Embodied AI for large-scale, physically plausible, and interactive 3D scenes that go beyond visual realism, enabling agents to learn diverse skills within simulated environments.	Leverages a conditional diffusion model guided by three novel functions ensuring: 1) collision avoidance between objects, 2) adherence to room layouts, and 3) object reachability for embodied agents.	Achieves state-of-the-art results on traditional scene synthesis metrics (FID, KID, etc.) while significantly improving physical plausibility. Significantly outperforms existing methods in generating interactive scenes with reduced object collisions and improved object reachability. Demonstrates the ability to effectively integrate articulated objects into scenes, further enhancing interactivity.	Currently limited to a limited number of room types due to data constraints. Lacks the inclusion of small objects, posing challenges for tasks involving fine-grained manipulation.	scene synthesis, embodied ai, diffusion models, physical plausibility, interactive environments
2404.09458 Report	CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting	Xiangrui Liu, Xinju Wu, Pingping Zhang, Shiqi Wang, Zhu Li, Sam Kwong	Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.	This paper introduces CompGS, a novel 3D scene representation method using compressed Gaussian splatting, achieving efficient representation with significantly reduced data size.	Gaussian splatting suffers from large data volumes, hindering its practicality. Existing compression methods overlook inherent redundancies in Gaussian primitives, leading to suboptimal compression efficiency.	CompGS leverages a hybrid primitive structure with anchor primitives to predict coupled primitives' attributes, enabling compact residual representations. It also employs rate-constrained optimization, minimizing rendering distortion and bitrate costs for optimal compactness.	CompGS achieves up to 110x compression ratio on popular datasets without sacrificing rendering quality. The hybrid primitive structure significantly reduces bitstream size by exploiting inter-primitive redundancies. Rate-constrained optimization further enhances compactness by learning efficient primitive representations.	Training time is slightly longer than some existing methods due to joint optimization of primitives and neural networks. The impact of varying the number of coupled primitives per anchor on different scenes requires further investigation.	3d scene representation, gaussian splatting, compression, rate-distortion optimization, hybrid primitive structure
2404.09447 Report	kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies	Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr	Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with zero forgetting, capable of adapting to continually growing vocabularies without the need for retraining or large memory costs. Our training-free approach, kNN-CLIP, leverages a database of instance embeddings to enable open-vocabulary segmentation approaches to continually expand their vocabulary on any given domain with a single-pass through data, while only storing embeddings minimizing both compute and memory costs. This method achieves state-of-the-art mIoU performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.	This paper introduces kNN-CLIP, a training-free method for continual vocabulary expansion in semantic and panoptic segmentation. It utilizes a retrieval database of instance embeddings, enabling the adaptation to new concepts without retraining or large memory costs.	Existing continual segmentation methods struggle to scale to large vocabularies and often suffer from catastrophic forgetting under compute constraints, limiting their practical applicability.	kNN-CLIP leverages a database of instance embeddings generated using a pre-trained DINOv2 model. At inference, query mask embeddings are matched against the database, and retrieved information augments the base model's predictions, enhancing performance on novel concepts.	kNN-CLIP achieves state-of-the-art mIoU performance across large-vocabulary semantic segmentation datasets (A-847, PC-459, A-150). The method demonstrates significant improvements in panoptic segmentation on ADE20K and COCO Panoptic datasets. kNN-CLIP effectively addresses catastrophic forgetting, outperforming traditional continual learning approaches while maintaining efficiency.	The reliance on brute-force kNN search can lead to slower inference times for large databases. Future work can explore approximate nearest neighbor search methods to balance speed and accuracy.	continual learning, open-vocabulary segmentation, semantic segmentation, panoptic segmentation, image retrieval
2404.09426 Report	ViFu: Multiple 360$^\circ$ Objects Reconstruction with Clean Background via Visible Part Fusion	Tianhan Xu, Takuya Ikeda, Koichi Nishiwaki	In this paper, we propose a method to segment and recover a static, clean background and multiple 360$^\circ$ objects from observations of scenes at different timestamps. Recent works have used neural radiance fields to model 3D scenes and improved the quality of novel view synthesis, while few studies have focused on modeling the invisible or occluded parts of the training images. These under-reconstruction parts constrain both scene editing and rendering view selection, thereby limiting their utility for synthetic data generation for downstream tasks. Our basic idea is that, by observing the same set of objects in various arrangement, so that parts that are invisible in one scene may become visible in others. By fusing the visible parts from each scene, occlusion-free rendering of both background and foreground objects can be achieved. We decompose the multi-scene fusion task into two main components: (1) objects/background segmentation and alignment, where we leverage point cloud-based methods tailored to our novel problem formulation; (2) radiance fields fusion, where we introduce visibility field to quantify the visible information of radiance fields, and propose visibility-aware rendering for the fusion of series of scenes, ultimately obtaining clean background and 360$^\circ$ object rendering. Comprehensive experiments were conducted on synthetic and real datasets, and the results demonstrate the effectiveness of our method.	This paper presents ViFu, a method to recover clean backgrounds and 360° foreground objects from multi-timestamp scene observations, addressing the issue of unseen part reconstruction in NeRF.	This is important for generating high-quality synthetic data for robot learning tasks, such as pose estimation and object detection, as it allows for diverse object placement and occlusion-free rendering.	The method uses point cloud registration for object/background alignment, introduces a novel "visibility field" to quantify visibility in radiance fields, and proposes "visibility-aware rendering" for fusing visible parts from different scenes.	ViFu can automatically segment backgrounds and recover clean, 360° renderings of multiple objects. The proposed visibility field effectively quantifies visibility in 3D scenes. Experiments on synthetic and real datasets demonstrate the effectiveness of the method.	The method doesn't explicitly consider lighting conditions, which may affect rendering quality under extreme lighting. It relies on accurate scene segmentation and pose alignment, which may be challenging for closely placed objects or simple shapes.	neural radiance fields, 3d reconstruction, scene segmentation, visibility field, synthetic data generation
2404.09412 Report	DeferredGS: Decoupled and Editable Gaussian Splatting with Deferred Shading	Tong Wu, Jia-Mu Sun, Yu-Kun Lai, Yuewen Ma, Leif Kobbelt, Lin Gao	Reconstructing and editing 3D objects and scenes both play crucial roles in computer graphics and computer vision. Neural radiance fields (NeRFs) can achieve realistic reconstruction and editing results but suffer from inefficiency in rendering. Gaussian splatting significantly accelerates rendering by rasterizing Gaussian ellipsoids. However, Gaussian splatting utilizes a single Spherical Harmonic (SH) function to model both texture and lighting, limiting independent editing capabilities of these components. Recently, attempts have been made to decouple texture and lighting with the Gaussian splatting representation but may fail to produce plausible geometry and decomposition results on reflective scenes. Additionally, the forward shading technique they employ introduces noticeable blending artifacts during relighting, as the geometry attributes of Gaussians are optimized under the original illumination and may not be suitable for novel lighting conditions. To address these issues, we introduce DeferredGS, a method for decoupling and editing the Gaussian splatting representation using deferred shading. To achieve successful decoupling, we model the illumination with a learnable environment map and define additional attributes such as texture parameters and normal direction on Gaussians, where the normal is distilled from a jointly trained signed distance function. More importantly, we apply deferred shading, resulting in more realistic relighting effects compared to previous methods. Both qualitative and quantitative experiments demonstrate the superior performance of DeferredGS in novel view synthesis and editing tasks.	ame is a novel method that introduces a decoupled and editable Gaussian Splatting representation using deferred shading, enabling independent editing of geometry, texture, and lighting.	Existing Gaussian Splatting methods struggle with independent texture and lighting editing and suffer from blending artifacts during relighting. ame addresses these limitations, enhancing editing capabilities and relighting quality.	The method uses a normal distillation module to enhance geometry reconstruction by leveraging an SDF network. It employs deferred shading for realistic relighting effects, rasterizing geometry and texture buffers before shading calculation at the pixel level.	ame shows superior novel view synthesis quality compared to previous methods, particularly on challenging scenes with reflections. It enables more faithful decomposition of geometry, texture, and lighting, evident in the high-quality normal maps and diffuse albedo estimations. ame demonstrates improved relighting quality by mitigating blending artifacts common in previous Gaussian Splatting methods that use forward shading.	The method exhibits limitations in scenes with strong shadows, where shadows might be baked into the texture. Texture editing can introduce noise due to the global nature of Gaussian Splatting representation.	gaussian splatting, inverse rendering, deferred shading, 3d scene reconstruction, scene editing
2404.09401 Report	Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models	Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka	Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However, there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue, we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss, GAN loss, and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes, and once the generator is trained, it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures, our method adds visible watermarks on the generated images, which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore, this work provides a simple yet powerful way to protect copyright from DM-based imitation.	This paper introduces a novel method for embedding personal watermarks into adversarial examples to prevent copyright infringement by diffusion models (DMs).	The widespread use of DMs raises concerns about copyright violations as they can be used to imitate unauthorized creations, potentially leading to illegal revenue generation.	The authors propose a conditional GAN architecture with a generator, discriminator, and a target DM. They design three losses: a GAN loss for image quality, a perturbation loss to control perturbation visibility, and an adversarial loss to target the latent space of LDMs for improved attack transferability.	The method effectively embeds visible watermarks in images generated by DMs, hindering unauthorized imitation and providing a clear indication of copyright violation. The generation process is significantly faster (0.2s per image) than existing iterative optimization methods. The generated adversarial examples exhibit good transferability across various DMs and image generation scenarios, including textual inversion and DreamBooth.	There is a trade-off between watermark visibility and adversarial example quality. Further investigation is needed on the effectiveness against more advanced defenses.	copyright protection, diffusion models, adversarial examples, watermarking, generative models
2404.09326 Report	Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers	Diana-Nicoleta Grigore, Mariana-Iuliana Georgescu, Jon Alvarez Justo, Tor Johansen, Andreea Iuliana Ionescu, Radu Tudor Ionescu	Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on five data sets from various domains, including natural, medical and satellite images. The empirical results confirm the superiority of our approach over competitive baselines. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline.	This paper introduces WeCoLoRA, a novel few-shot unsupervised feature distillation method for vision transformers, which combines intermittent weight copying from a teacher model with an enhanced low-rank adaptation (LoRA) approach.	Training large-scale vision transformers demands extensive computational resources and data. WeCoLoRA addresses this by enabling efficient learning from limited data, making it valuable for resource-constrained environments and domains with data scarcity.	WeCoLoRA operates in two steps: 1) intermittently copying weights from a pre-trained teacher transformer to a smaller student, 2) using an enhanced LoRA, applied to all components of the transformer block, to distill knowledge from the teacher to the student in a few-shot setting.	WeCoLoRA outperforms state-of-the-art few-shot knowledge distillation methods, including DeiT and DMAE, achieving significantly higher accuracy on benchmarks like ImageNet. The method demonstrates robustness across different compression ratios, teacher models (both supervised and self-supervised), and varying sizes of training data. Visualization of the learned feature space reveals that WeCoLoRA produces more discriminative and robust embeddings compared to baseline approaches.	The current design of WeCoLoRA, specifically the weight copying mechanism, limits its applicability to architectures with consistent configurations across layers, like transformers and ResNets. Future work will focus on generalizing the weight copying mechanism using adaptor blocks to extend the method’s compatibility with a wider range of model architectures.	knowledge distillation, low rank adaptation, vision transformers, few-shot learning, unsupervised learning
2404.09227 Report	DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling	Xuening Yuan, Hongyu Yang, Yueming Zhao, Di Huang	Recent progress in text-to-3D creation has been propelled by integrating the potent prior of Diffusion Models from text-to-image generation into the 3D domain. Nevertheless, generating 3D scenes characterized by multiple instances and intricate arrangements remains challenging. In this study, we present DreamScape, a method for creating highly consistent 3D scenes solely from textual descriptions, leveraging the strong 3D representation capabilities of Gaussian Splatting and the complex arrangement abilities of large language models (LLMs). Our approach involves a 3D Gaussian Guide ($3{DG^2}$) for scene representation, consisting of semantic primitives (objects) and their spatial transformations and relationships derived directly from text prompts using LLMs. This compositional representation allows for local-to-global optimization of the entire scene. A progressive scale control is tailored during local object generation, ensuring that objects of different sizes and densities adapt to the scene, which addresses training instability issue arising from simple blending in the subsequent global optimization stage. To mitigate potential biases of LLM priors, we model collision relationships between objects at the global level, enhancing physical correctness and overall realism. Additionally, to generate pervasive objects like rain and snow distributed extensively across the scene, we introduce a sparse initialization and densification strategy. Experiments demonstrate that DreamScape offers high usability and controllability, enabling the generation of high-fidelity 3D scenes from only text prompts and achieving state-of-the-art performance compared to other methods.	This LaTeX code doesn't present any research or findings. It's a snippet for managing citations and bibliography in a LaTeX document.	N/A	N/A			latex, bibliography, citations, acm style, academic writing
2404.09216 Report	DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection	Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu	Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.	This paper introduces DetCLIPv3, an open-vocabulary object detector that can also generate hierarchical labels for detected objects.	Existing open-vocabulary object detectors rely on predefined categories, limiting their real-world applicability. DetCLIPv3 overcomes this by generating object labels even without category input, allowing for richer interpretation of visual content.	DetCLIPv3 leverages a versatile architecture with an open-vocabulary detector and an object captioner. It utilizes an auto-annotation pipeline with VLLMs to create a large-scale dataset (GranuCap50M) with rich object labels. A multi-stage training strategy (low-resolution pretraining and high-resolution fine-tuning) ensures efficient learning from massive image-text pairs.	Achieves 47.0 zero-shot fixed AP on LVIS minival, outperforming prior arts like GLIPv2 and DetCLIPv2. Achieves state-of-the-art 19.7 AP in dense captioning on VG, showcasing its strong generative capability. Shows superior domain generalization, with Swin-L model achieving 48.8 AP on COCO-O, surpassing its COCO performance.	Evaluation of generative capability is limited by existing benchmarks. Current model lacks instruction-controlled detection.	open-vocabulary object detection, generative detection, hierarchical object labels, auto-annotation pipeline, vision-language models
2404.09204 Report	TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng	Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.	This paper introduces TextHawk, a novel Multimodal Large Language Model (MLLM) specifically designed to address the challenges of document-oriented tasks while retaining strong general vision-language capabilities.	Document images, with their high resolution and information density, pose significant challenges for MLLMs, necessitating improved fine-grained visual perception and efficient information compression.	TextHawk incorporates several key components: a ReSampling and ReArrangement (ReSA) module for information compression, Scalable Positional Embeddings (SPEs) for sub-image representation, a Query Proposal Network (QPN) for dynamic query generation, a Multi-Level Cross-Attention (MLCA) mechanism for enhanced perception, and a new instruction-tuning dataset enriched with Gemini Pro.	TextHawk outperforms state-of-the-art methods on both document-oriented and general MLLM benchmarks. The model excels in fine-grained tasks like document understanding and referring expression comprehension. TextHawk achieves a good balance between general vision-language tasks and specialized document-oriented tasks.	The visual encoder in TextHawk is frozen during training, potentially limiting its adaptability to new visual data. Future work will focus on training the vision encoder to further enhance perception capabilities.	multimodal large language models, document understanding, visual question answering, fine-grained visual perception, information compression
2404.09172 Report	LoopAnimate: Loopable Salient Object Animation	Fanyi Wang, Peng Liu, Haotian Hu, Dan Meng, Jingwen Su, Jinjin Xu, Yanhao Zhang, Xiaoming Ren, Zhiwang Zhang	Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.	LoopAnimate is a novel image-to-video generation method that generates loopable videos with a length of 35 frames, improving object fidelity and extending generation length.	Existing video generation models have limitations in object fidelity, generation length and lack the ability to create seamlessly looping videos which are needed in applications like dynamic wallpapers.	The paper proposes a multi-level image representation and textual semantic decoupling framework, a three-stage training strategy progressively increasing the number of frames, and an Asymmetric Loop Sampling Strategy for loopable video generation.	LoopAnimate outperforms state-of-the-art methods in object fidelity and motion quality, particularly for human portraits. The three-stage training strategy successfully extends the generation length to 35 frames while preserving video quality. A specially designed condition initialization method and asymmetric loop sampling strategy enables generation of loopable videos.	The model relies on accurate salient object detection for optimal performance. Further research can explore extending the generation length beyond the current 35-frame limit.	diffusion models, image-to-video generation, loopable video, long video generation, object fidelity
2404.09111 Report	Exploring Generative AI for Sim2Real in Driving Data Synthesis	Haonan Zhao, Yiting Wang, Thomas Bashford-Rogers, Valentina Donzella, Kurt Debattista	Datasets are essential for training and testing vehicle perception algorithms. However, the collection and annotation of real-world images is time-consuming and expensive. Driving simulators offer a solution by automatically generating various driving scenarios with corresponding annotations, but the simulation-to-reality (Sim2Real) domain gap remains a challenge. While most of the Generative Artificial Intelligence (AI) follows the de facto Generative Adversarial Nets (GANs)-based methods, the recent emerging diffusion probabilistic models have not been fully explored in mitigating Sim2Real challenges for driving data synthesis. To explore the performance, this paper applied three different generative AI methods to leverage semantic label maps from a driving simulator as a bridge for the creation of realistic datasets. A comparative analysis of these methods is presented from the perspective of image quality and perception. New synthetic datasets, which include driving images and auto-generated high-quality annotations, are produced with low costs and high scene variability. The experimental results show that although GAN-based methods are adept at generating high-quality images when provided with manually annotated labels, ControlNet produces synthetic datasets with fewer artefacts and more structural fidelity when using simulator-generated labels. This suggests that the diffusion-based approach may provide improved stability and an alternative method for addressing Sim2Real challenges.	This paper explores three generative AI methods (two GAN-based, one diffusion-based) to generate realistic driving datasets from simulator semantic label maps, aiming to bridge the simulation-to-reality gap.	Collecting and annotating real-world driving data is expensive and limited in scenario diversity. Simulators can generate diverse scenarios but often lack realism, hindering their use in training robust perception algorithms.	The paper leverages semantic label maps from the CARLA simulator and Cityscapes dataset. It trains Pix2pixHD, OASIS (GAN-based), and ControlNet (diffusion-based) models to translate these maps into realistic images.	GAN-based methods excel in image quality when trained on manually annotated Cityscapes labels but struggle with simulator labels. ControlNet, while stylistically different from Cityscapes, generates images with fewer artefacts and better structural fidelity, especially with simulator labels. Findings suggest ControlNet's diffusion process may offer better stability and robustness in handling variations in label accuracy.	The study primarily focuses on semantic segmentation, limiting the assessment of other perception tasks. Future work could explore modifying ControlNet to improve the realism and diversity of generated images while preserving structural accuracy.	generative ai, sim2real, driving data synthesis, diffusion models, semantic segmentation
2404.09105 Report	EGGS: Edge Guided Gaussian Splatting for Radiance Fields	Yuanhao Gong	The Gaussian splatting methods are getting popular. However, their loss function only contains the $\ell_1$ norm and the structural similarity between the rendered and input images, without considering the edges in these images. It is well-known that the edges in an image provide important information. Therefore, in this paper, we propose an Edge Guided Gaussian Splatting (EGGS) method that leverages the edges in the input images. More specifically, we give the edge region a higher weight than the flat region. With such edge guidance, the resulting Gaussian particles focus more on the edges instead of the flat regions. Moreover, such edge guidance does not crease the computation cost during the training and rendering stage. The experiments confirm that such simple edge-weighted loss function indeed improves about $1\sim2$ dB on several difference data sets. With simply plugging in the edge guidance, the proposed method can improve all Gaussian splatting methods in different scenarios, such as human head modeling, building 3D reconstruction, etc.	Introduces Edge Guided Gaussian Splatting (EGGS), improving radiance field accuracy in 3D Gaussian splatting methods by weighting edges in the loss function.	Edges are visually important, and existing Gaussian splatting methods treat all pixels equally in their loss functions, leading to suboptimal results.	EGGS incorporates an edge-weighting function (e.g., image gradient) into the loss function, giving higher importance to edge pixels during optimization.	EGGS achieves 1-2 dB PSNR improvement over standard 3DGS on various datasets. Edge guidance encourages Gaussian particles to align with edges, improving scene geometry representation. The method is computationally efficient, adding no overhead to training or rendering.	PSNR improvement may vary depending on scene complexity, image resolution, and other factors. Future work includes exploring more sophisticated edge detection methods and applying EGGS to other 3DGS variants.	gaussian splatting, radiance fields, 3d reconstruction, edge detection, computer vision
2404.08921 Report	PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos	Qi Zhao, M. Salman Asif, Zhan Ma	The primary focus of Neural Representation for Videos (NeRV) is to effectively model its spatiotemporal consistency. However, current NeRV systems often face a significant issue of spatial inconsistency, leading to decreased perceptual quality. To address this issue, we introduce the Pyramidal Neural Representation for Videos (PNeRV), which is built on a multi-scale information connection and comprises a lightweight rescaling operator, Kronecker Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer, facilitates low-cost rescaling and global correlation modeling. BSM merges high-level features with granular ones adaptively. Furthermore, we provide an analysis based on the Universal Approximation Theory of the NeRV system and validate the effectiveness of the proposed PNeRV.We conducted comprehensive experiments to demonstrate that PNeRV surpasses the performance of contemporary NeRV models, achieving the best results in video regression on UVG and DAVIS under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV, PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along with a +3.28 dB PSNR and 634% FVD increase on DAVIS.	This paper introduces PNeRV (Pyramidal Neural Representation for Videos) to address the spatial inconsistency issue in current NeRV systems, aiming for enhanced spatiotemporal consistency in video representation.	Existing NeRV systems suffer from poor perceptual quality due to spatial inconsistency, stemming from a lack of global receptive field and multi-scale information communication. This limits their ability to model complex videos effectively.	PNeRV leverages a multi-scale information connection using a lightweight Kronecker Fully-connected (KFc) layer for low-cost upsampling and global correlation modeling. It also employs a Benign Selective Memory (BSM) mechanism for adaptive merging of high-level and granular features. The paper also provides a Universal Approximation Theory analysis for NeRV.	PNeRV outperforms state-of-the-art NeRV models in video regression tasks on UVG and DAVIS datasets, achieving superior performance in PSNR, SSIM, LPIPS, and FVD metrics. PNeRV demonstrates significant improvement in spatial consistency, reducing noise and artifacts in reconstructed videos, especially in scenes with complex spatiotemporal features. Ablation studies confirm the effectiveness of KFc and BSM in enhancing perceptual quality and demonstrate the superiority of the pyramidal structure for multi-scale feature learning.	The hierarchical structure in PNeRV increases computational complexity, demanding further optimization for practical applications. Future work will explore the theoretical analysis and enhancement of PNeRV's generalization abilities, particularly in video interpolation tasks.	neural video representation, implicit neural representation, video coding, perceptual quality, multi-scale feature learning
2404.08819 Report	The Illusion of State in State-Space Models	William Merrill, Jackson Petty, Ashish Sabharwal	State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill and Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class $\mathsf{TC}^0$. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems.	This paper demonstrates that state-space models (SSMs), like transformers, are limited in their expressive power for state tracking and cannot solve problems outside the complexity class TC^0.	SSMs have been proposed as alternatives to transformers, with potential advantages in handling stateful and sequential problems. This work investigates whether these advantages hold true theoretically and practically.	The authors employ circuit complexity analysis to prove that linear and Mamba-style SSMs fall within the TC^0 complexity class, limiting their ability to express complex state tracking. They also conduct experiments on permutation composition tasks to empirically evaluate the state-tracking capabilities of SSMs compared to transformers and RNNs.	Theoretically, linear and Mamba-style SSMs are limited to TC^0 complexity, similar to transformers, preventing them from solving problems like permutation composition. Empirically, SSMs and transformers fail to learn permutation composition with a fixed depth, unlike RNNs. SSMs, while still limited, empirically perform better than transformers on approximate state tracking for less complex tasks.	The analysis focuses on specific SSM architectures (linear and Mamba-style) and might not cover all variants. Future work could explore alternative SSM designs that balance parallelizability and state-tracking expressiveness.	state-space models, transformers, state tracking, circuit complexity, expressive power
2404.08814 Report	E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited Data	Aref Azizpour, Tai D. Nguyen, Manil Shrestha, Kaidi Xu, Edward Kim, Matthew C. Stamm	As generative AI progresses rapidly, new synthetic image generators continue to emerge at a swift pace. Traditional detection methods face two main challenges in adapting to these generators: the forensic traces of synthetic images from new techniques can vastly differ from those learned during training, and access to data for these new generators is often limited. To address these issues, we introduce the Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors. E3 enables the accurate detection of images from newly emerged generators using minimal training data. Our approach does this by first employing transfer learning to develop a suite of expert embedders, each specializing in the forensic traces of a specific generator. Then, all embeddings are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate and reliable detection decisions. Our experiments demonstrate that E3 outperforms existing continual learning methods, including those developed specifically for synthetic image detection.	The paper introduces Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors to accurately detect images from newly emerged generators using minimal training data.	Traditional detection methods struggle to adapt to new synthetic image generators due to the vastly different forensic traces and limited access to data from these generators. This necessitates continual updating of detectors, which poses challenges like catastrophic forgetting and data inefficiency.	E3 employs transfer learning to develop a suite of expert embedders, each specializing in forensic traces of a specific generator. Embeddings from all experts are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate detection decisions.	E3 significantly outperforms existing continual learning methods, including those designed for synthetic image detection. E3 exhibits strong and stable performance across various new generators, including those with limited training data. The framework demonstrates generality by achieving superior results across multiple detector architectures.	The ensemble approach increases network size, although the increase in parameters is manageable and outweighed by the improved detection accuracy. Future work could explore compressing the model or reducing the number of experts to address the increased network size.	synthetic image detection, continual learning, generative adversarial networks (gans), transfer learning, ensemble learning
2404.08639 Report	COCONut: Modernizing COCO Segmentation	Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen	In recent decades, the vision community has witnessed remarkable progress in visual recognition, partially owing to advancements in dataset benchmarks. Notably, the established COCO benchmark has propelled the development of modern detection and segmentation systems. However, the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances, it gradually incorporated coarse superpixel annotations for stuff regions, which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations, executed by different groups of raters, have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study, we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic, instance, and panoptic segmentation with meticulously crafted high-quality masks, and establishes a robust benchmark for all segmentation tasks. To our knowledge, COCONut stands as the inaugural large-scale universal segmentation dataset, verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.	The paper introduces COCONut, a large-scale universal segmentation dataset designed to modernize and improve upon the COCO segmentation annotations.	The original COCO segmentation annotations suffer from limitations such as coarse masks, inconsistencies between segmentation types, and a relatively small dataset size, hindering the evaluation and training of modern segmentation models.	The authors developed an assisted-manual annotation pipeline and a data engine to efficiently create high-quality segmentation masks. The pipeline leverages neural networks for generating proposals and allows human raters to edit and refine them. The data engine iteratively expands the dataset while maintaining high annotation quality.	COCONut provides human-verified annotations for 383K images and 5.18M masks, surpassing the size and quality of existing datasets. The assisted-manual pipeline significantly accelerates the annotation process while ensuring high-quality masks. Experiments demonstrate that models trained on COCONut outperform those trained on COCO, highlighting the importance of large-scale, high-quality annotations.	The dataset is currently limited to 133 semantic classes, potentially limiting its applicability to open-vocabulary segmentation tasks. Future work could explore incorporating more diverse image sources and further expanding the dataset size.	segmentation, dataset, coco, annotation, deep learning
2404.08636 Report	Probing the 3D Awareness of Visual Foundation Models	Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani	Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.	This paper investigates the 3D awareness of visual foundation models, examining how well they capture 3D scene structure and exhibit consistency across different viewpoints.	Understanding the 3D awareness of visual foundation models is crucial for assessing their capabilities and limitations in representing the 3D world, particularly as they are increasingly used in 3D-related tasks.	The authors probe frozen representations of various large-scale pretrained models using task-specific probes and zero-shot inference for depth estimation, surface normal estimation, and 3D correspondence on both scene-level (NYUv2, ScanNet) and object-level (NAVI) datasets.	Discriminative self-supervised models like DINOv2 exhibit the strongest 3D awareness, demonstrating impressive performance in encoding depth and surface normals. Models show good correspondence estimation for small viewpoint changes but struggle with large viewpoint variations, indicating a lack of true 3D consistency. Vision-language models like CLIP perform poorly in capturing 3D information despite their strong semantic generalization abilities.	The study relies on publicly available checkpoints trained on diverse datasets with varying scales and recipes, limiting controlled comparisons. The analysis focuses on specific aspects of 3D awareness and probing methods, potentially overlooking other facets of 3D understanding and evaluation techniques.	3d vision, visual foundation models, self-supervised learning, representation learning, multiview consistency
2404.08603 Report	Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation	Yanhao Zheng, Kai Liu	Open-vocabulary object detection (OVOD) aims at localizing and recognizing visual objects from novel classes unseen at the training time. Whereas, empirical studies reveal that advanced detectors generally assign lower scores to those novel instances, which are inadvertently suppressed during inference by commonly adopted greedy strategies like Non-Maximum Suppression (NMS), leading to sub-optimal detection performance for novel classes. This paper systematically investigates this problem with the commonly-adopted two-stage OVOD paradigm. Specifically, in the region-proposal stage, proposals that contain novel instances showcase lower objectness scores, since they are treated as background proposals during the training phase. Meanwhile, in the object-classification stage, novel objects share lower region-text similarities (i.e., classification scores) due to the biased visual-language alignment by seen training samples. To alleviate this problem, this paper introduces two advanced measures to adjust confidence scores and conserve erroneously dismissed objects: (1) a class-agnostic localization quality estimate via overlap degree of region/object proposals, and (2) a text-guided visual similarity estimate with proxy prototypes for novel classes. Integrated with adjusting techniques specifically designed for the region-proposal and object-classification stages, this paper derives the aggregated confidence estimate for the open-vocabulary object detection paradigm (AggDet). Our AggDet is a generic and training-free post-processing scheme, which consistently bolsters open-vocabulary detectors across model scales and architecture designs. For instance, AggDet receives 3.3% and 1.5% gains on OV-COCO and OV-LVIS benchmarks respectively, without any training cost.	This paper introduces AggDet, a training-free post-processing method for open-vocabulary object detection (OVOD), which aggregates confidence estimates from both region-proposal and object-classification stages to boost the detection performance on novel classes.	Current OVOD detectors exhibit a significant performance gap between novel and base classes, due to the underestimated confidence scores for novel instances in both region-proposal and object-classification stages.	AggDet leverages (1) a class-agnostic localization quality estimate via the overlap degree of region proposals, and (2) a text-guided visual similarity estimate with proxy prototypes for novel classes, to adjust the confidence scores during inference.	AggDet consistently enhances various OVOD detectors across model scales and architectures, without any training cost. The method achieves up to 3.3% and 1.5% gains on OV-COCO and OV-LVIS benchmarks, respectively. AggDet introduces minimal computational overhead, with less than 1 ms latency during inference.	The hyper-parameters in AggDet need to be slightly tuned for different datasets. Future work could explore incorporating the aggregation techniques into the training paradigm for further performance improvement.	open-vocabulary object detection, confidence aggregation, region proposal, object classification, zero-shot learning
2404.08590 Report	Improving Referring Image Segmentation using Vision-Aware Text Features	Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, Sai-Kit Yeung	Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: https://nero1342.github.io/VATEX\_RIS.	This paper proposes VATEX, a novel framework that leverages vision-aware text features to enhance the performance of referring image segmentation.	Existing methods often struggle with complex or ambiguous language expressions, leading to inaccurate segmentation results. VATEX addresses this by enhancing object and context understanding through a deeper integration of visual and textual information.	The proposed method utilizes a CLIP Prior for object localization, a Contextual Multimodal Decoder (CMD) for hierarchical visual-textual feature fusion, and a Meaning Consistency Constraint (MCC) to enforce consistent representation of different expressions referring to the same object.	VATEX achieves state-of-the-art performance on three referring image segmentation benchmarks: RefCOCO, RefCOCO+, and G-Ref. The method also demonstrates strong performance on referring video object segmentation benchmarks, Ref-YouTube-VOS and Ref-DAVIS17. Ablation studies and qualitative analysis validate the contribution of each proposed component (CLIP Prior, CMD, MCC) to the overall performance improvement.	The method currently does not explicitly model relationships between objects or actions, limiting its accuracy in scenarios requiring such understanding. Future work will focus on incorporating object interaction and action alignment into the framework for improved segmentation in more complex scenarios.	referring image segmentation, vision-aware text features, clip localization, multimodal understanding, meaning consistency constraint
2404.08580 Report	Lossy Image Compression with Foundation Diffusion Models	Lucas Relic, Roberto Azevedo, Markus Gross, Christopher Schroers	Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.	This paper proposes a novel lossy image compression codec leveraging foundation latent diffusion models for realistic image reconstruction, particularly at low bitrates.	Existing image compression methods often produce unrealistic or distorted images at very low bitrates. This work addresses this by using diffusion models to synthesize lost details and enhance perceptual quality.	The method combines a variational autoencoder from a pre-trained latent diffusion model, adaptive quantization, a learned timestep prediction module for optimal denoising, and an entropy model. It processes a quantized latent representation and uses the diffusion model for denoising, allowing for a significant reduction in the number of diffusion steps compared to previous works.	The proposed codec achieves state-of-the-art results in image realism as measured by FID, outperforming previous generative compression methods. It maintains competitive performance in traditional distortion metrics like LPIPS and MS-SSIM, especially compared to other diffusion-based codecs. Subjective user study confirms that the reconstructions are visually preferred over other state-of-the-art methods, even at lower bitrates.	The method might inaccurately reconstruct certain image details due to limitations of the foundation diffusion model's VAE. Potential misgeneration of content at very low bitrates raises ethical concerns.	image compression, latent diffusion, generative models, low bitrate, image realism
2404.08540 Report	On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation	Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang	Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally, to provide a foundation for future research, we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation, our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings	This paper investigates the impact of natural language guidance on monocular depth estimation, particularly its generalization and robustness.	Understanding the role of language priors is crucial for effectively deploying depth estimation in real-world applications like autonomous driving and robotics.	The authors systematically evaluate language-guided depth estimation by: (1) generating sentences describing spatial relationships between objects, image captions, and activity descriptions, (2) conducting supervised and zero-shot experiments with varying language inputs, (3) analyzing robustness under adversarial conditions like object masking and distribution shifts.	Existing language-guided methods exhibit a strong scene-level bias, performing optimally with scene-level descriptions but deteriorating with low-level spatial relationships. Language-guided models are less robust to distribution shifts and adversarial attacks compared to vision-only methods. Performance improvement is observed with an increase in the number of low-level spatial sentences, suggesting a need for sufficient scene-level representation.	The study primarily focuses on the VPD model and may not fully represent all language-guided depth estimators. Future work should explore alternative methods for incorporating language priors and improving robustness to domain shifts.	depth estimation, language guidance, robustness, distribution shift, adversarial attacks
2404.08506 Report	LaSagnA: Language-based Segmentation Assistant for Complex Queries	Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma	Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format. The effectiveness of our model in processing complex queries is validated by the comparable results with conventional methods on both close-set and open-set semantic segmentation datasets. Additionally, we outperform a series of vLLMs in reasoning and referring segmentation, showcasing our model's remarkable capabilities. We release the code at https://github.com/congvvc/LaSagnA.	This paper presents LaSagnA, a Large Language Model for vision (vLLM) that handles complex queries involving multiple arbitrary targets, which may or may not exist in an image, by introducing a new input sequence format and incorporating semantic segmentation tasks into training.	Existing vLLM-based segmentation assistants struggle with complex queries because their training primarily revolves around single-target scenarios where the queried object is always present in the image. This limits their applicability in real-world settings where multiple or even non-existent targets might be queried.	The authors define a new sequence format that incorporates multiple classes and negative classes. They integrate semantic segmentation tasks into the training process, and to address challenges in training with this new format, they propose three strategies: sequence augmentation (adding negative classes to the response), random classes list (using a dynamic list of categories in the query), and target order consistency (aligning category order in response with the query).	LaSagnA achieves comparable results to state-of-the-art segmentation specialists on both closed-set and open-set semantic segmentation benchmarks. The model outperforms previous vLLMs on referring segmentation tasks, demonstrating its enhanced ability to locate and segment objects based on complex language descriptions. LaSagnA exhibits promising zero-shot performance on the generalized referring segmentation benchmark (gRefCOCO), highlighting its capacity to handle unseen scenarios with multiple and non-existent targets.	While LaSagnA excels in high-level understanding, its accuracy in capturing low-level visual details and handling small or crowded objects still lags behind specialized segmentation models. Further research is needed to develop lighter and more efficient vLLMs and mask decoders to enhance computational efficiency.	large language models for vision (vllms), semantic segmentation, referring segmentation, complex query handling, open-set segmentation
2404.08449 Report	OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering	Jingrui Ye, Zongkai Zhang, Yujiao Jiang, Qingmin Liao, Wenming Yang, Zongqing Lu	Rendering dynamic 3D human from monocular videos is crucial for various applications such as virtual reality and digital entertainment. Most methods assume the people is in an unobstructed scene, while various objects may cause the occlusion of body parts in real-life scenarios. Previous method utilizing NeRF for surface rendering to recover the occluded areas, but it requiring more than one day to train and several seconds to render, failing to meet the requirements of real-time interactive applications. To address these issues, we propose OccGaussian based on 3D Gaussian Splatting, which can be trained within 6 minutes and produces high-quality human renderings up to 160 FPS with occluded input. OccGaussian initializes 3D Gaussian distributions in the canonical space, and we perform occlusion feature query at occluded regions, the aggregated pixel-align feature is extracted to compensate for the missing information. Then we use Gaussian Feature MLP to further process the feature along with the occlusion-aware loss functions to better perceive the occluded area. Extensive experiments both in simulated and real-world occlusions, demonstrate that our method achieves comparable or even superior performance compared to the state-of-the-art method. And we improving training and inference speeds by 250x and 800x, respectively. Our code will be available for research purposes.	OccGaussian, a novel method for rendering humans in monocular videos with occlusions using 3D Gaussian Splatting, achieving fast training and real-time rendering.	Previous methods for rendering humans under occlusion are too slow in training and inference, limiting their real-world applications.	OccGaussian leverages aggregated pixel-aligned features from visible points to recover occluded regions. It employs a K-nearest feature query and MLPs to model occluded points' colors and opacities. Additionally, it incorporates occlusion and consistency losses for enhanced rendering in occluded areas.	OccGaussian achieves comparable or better rendering quality than the state-of-the-art method OccNeRF. It significantly reduces training time to 6-13 minutes, approximately 250 times faster than OccNeRF. It enables real-time rendering at up to 169 FPS, 800 times faster than OccNeRF.	OccGaussian may struggle to fully recover regions occluded for extended periods due to weak supervision. Reliance on accurate human poses and camera parameters can limit its performance on in-the-wild videos.	human rendering, occlusion handling, 3d gaussian splatting, monocular video, real-time rendering
2404.08312 Report	GPN: Generative Point-based NeRF	Haipeng Wang	Scanning real-life scenes with modern registration devices typically gives incomplete point cloud representations, primarily due to the limitations of partial scanning, 3D occlusions, and dynamic light conditions. Recent works on processing incomplete point clouds have always focused on point cloud completion. However, these approaches do not ensure consistency between the completed point cloud and the captured images regarding color and geometry. We propose using Generative Point-based NeRF (GPN) to reconstruct and repair a partial cloud by fully utilizing the scanning images and the corresponding reconstructed cloud. The repaired point cloud can achieve multi-view consistency with the captured images at high spatial resolution. For the finetunes of a single scene, we optimize the global latent condition by incorporating an Auto-Decoder architecture while retaining multi-view consistency. As a result, the generated point clouds are smooth, plausible, and geometrically consistent with the partial scanning images. Extensive experiments on ShapeNet demonstrate that our works achieve competitive performances to the other state-of-the-art point cloud-based neural scene rendering and editing performances.	This paper proposes GPN, a lightweight, generalizable point-based NeRF framework that reconstructs and repairs partial point clouds using scanning images and reconstructed clouds, ensuring multi-view consistency.	Existing point cloud completion methods often lack consistency between the completed point cloud and captured images in terms of color and geometry. GPN addresses this limitation by leveraging both scanning images and point clouds.	GPN uses a hypernetwork paradigm-based VAE architecture for generalization training and an auto-decoder-based fine-tuning strategy for per-scene optimization. It proposes two frameworks: "Generation Framework" for complete clouds and "Completion Framework" for repairing incomplete clouds.	GPN achieves competitive performance on ShapeNet for point cloud rendering and editing. The generated point clouds are smooth, plausible, and geometrically consistent with the input images. GPN enables point cloud completion while maintaining multi-view consistency with the captured images.	The current implementation of GPN requires further exploration to improve speed and accuracy using techniques like Gaussian splatting. Future work can explore incorporating diffusion models for more diverse generation capabilities.	point cloud, nerf, generative model, point cloud completion, multi-view consistency
2404.08273 Report	Struggle with Adversarial Defense? Try Diffusion	Yujie Li, Yanbin Wang, Haitao Xu, Bin Liu, Jianguo Sun, Zhenhao Guo, Wenrui Ma	Adversarial attacks induce misclassification by introducing subtle perturbations. Recently, diffusion models are applied to the image classifiers to improve adversarial robustness through adversarial training or by purifying adversarial noise. However, diffusion-based adversarial training often encounters convergence challenges and high computational expenses. Additionally, diffusion-based purification inevitably causes data shift and is deemed susceptible to stronger adaptive attacks. To tackle these issues, we propose the Truth Maximization Diffusion Classifier (TMDC), a generative Bayesian classifier that builds upon pre-trained diffusion models and the Bayesian theorem. Unlike data-driven classifiers, TMDC, guided by Bayesian principles, utilizes the conditional likelihood from diffusion models to determine the class probabilities of input images, thereby insulating against the influences of data shift and the limitations of adversarial training. Moreover, to enhance TMDC's resilience against more potent adversarial attacks, we propose an optimization strategy for diffusion classifiers. This strategy involves post-training the diffusion model on perturbed datasets with ground-truth labels as conditions, guiding the diffusion model to learn the data distribution and maximizing the likelihood under the ground-truth labels. The proposed method achieves state-of-the-art performance on the CIFAR10 dataset against heavy white-box attacks and strong adaptive attacks. Specifically, TMDC achieves robust accuracies of 82.81% against $l_{\infty}$ norm-bounded perturbations and 86.05% against $l_{2}$ norm-bounded perturbations, respectively, with $\epsilon=0.05$.	This paper proposes the Truth Maximization Diffusion Classifier (TMDC), a generative Bayesian classifier built on pre-trained diffusion models, to enhance adversarial robustness against image classification attacks.	Existing defense strategies like adversarial training and image denoising are either computationally expensive, face convergence issues, or are susceptible to adaptive attacks. This highlights the need for a more robust and efficient defense mechanism.	The authors leverage pre-trained diffusion models and Bayesian theorem to compute class probabilities, minimizing the influence of data shift and limitations of adversarial training. They further propose a Truth Maximization optimization strategy, training the diffusion model on perturbed datasets with ground-truth labels to maximize the likelihood under true labels.	Diffusion Classifier demonstrates superior robustness against white-box attacks compared to traditional neural networks even without training. Truth Maximization optimization significantly improves the adversarial robustness of the Diffusion Classifier, outperforming conventional adversarial training methods. TMDC achieves state-of-the-art accuracy on CIFAR-10 against strong white-box and combined adaptive attacks (Auto Attack), reaching 82.81% and 86.05% accuracy for l-infinity and l2 norms, respectively, with epsilon=0.05.	TMDC still requires training on adversarial samples, posing computational challenges. Future work can explore decoupling training by optimizing the sampling strategy during inference to enhance both robustness and efficiency.	diffusion models, adversarial robustness, generative classifier, adversarial attacks, truth maximization
2404.08252 Report	MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance	Yuqun Wu, Jae Yong Lee, Chuhang Zou, Shenlong Wang, Derek Hoiem	The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively leverages monocular surface normal and relative depth predictions. The patch-based ray sampling also enables the appearance regularization of normalized cross-correlation (NCC) and structural similarity (SSIM) between randomly sampled virtual and training views. We further show that "density restrictions" based on sparse structure-from-motion points can help greatly improve geometric accuracy with a slight drop in novel view synthesis metrics. Our experiments show 4x the performance of RegNeRF and 8x that of FreeNeRF on average F1@2cm for ETH3D MVS benchmark, suggesting a fruitful research direction to improve the geometric accuracy of NeRF-based models, and sheds light on a potential future approach to enable NeRF-based optimization to eventually outperform traditional MVS.	Proposes MonoPatchNeRF, a patch-based regularized NeRF model that leverages monocular depth and normal predictions and virtual view appearance consistency priors for accurate 3D models from sparse views.	NeRF struggles with accurate geometry and view extrapolation, especially in sparse view scenarios, while MVS methods, though better geometrically, often yield noisy and incomplete models with limited rendering capabilities.	Employs patch-based ray sampling to effectively integrate monocular cues, utilizes NCC and SSIM losses for virtual view appearance consistency, and introduces density restrictions based on aligned sparse SfM points to refine geometry.	Achieves 4x better geometric accuracy than RegNeRF and 8x better than FreeNeRF on ETH3D. Outperforms other NeRF-based methods in novel view synthesis, ranking best in SSIM and LPIPS. Demonstrates improved handling of challenging large-scale scenes, surpassing MonoSDF in TnT's advanced scenes.	Geometric accuracy still falls short of MVS systems, even with MVS supervision. The method is computationally slower than traditional MVS approaches.	neural radiance fields, multi-view stereo, 3d reconstruction, monocular depth estimation, sparse view synthesis
2404.08197 Report	Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies	Zichao Li, Cihang Xie, Ekin Dogus Cubuk	This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.	This paper provides a comprehensive study on scaling down Contrastive Language-Image Pre-training (CLIP) for limited computational budgets, focusing on data, architecture, and training strategies.	The goal is to make CLIP models more accessible and affordable for practical use in various applications by providing insights into efficient training and deployment under resource constraints.	The authors conduct experiments on the WebLI dataset, comparing different data sizes and qualities, various vision encoder architectures (ViT, CNN), and training strategies (SLIP, FLIP, CLIP, CLIP+Data Augmentation) while evaluating zero-shot, linear probing, and retrieval performances.	High-quality data is crucial, as a smaller subset with higher quality can outperform a larger, lower-quality dataset. The choice of vision encoder architecture depends on the dataset size and compute budget; CNNs can be advantageous for smaller datasets, while larger ViTs benefit from larger datasets. Data augmentation techniques, particularly Stacked RandAugment, significantly improve CLIP performance with minimal computational overhead.	The study primarily focuses on English language image-text pairs from the WebLI dataset, potentially limiting generalizability to other languages or domains. Future work could explore other efficient architectures and self-supervised learning methods for further computational cost reduction.	clip, contrastive learning, vision transformer, data augmentation, resource constraints
2404.08187 Report	Adapting CNNs for Fisheye Cameras without Retraining	Ryan Griffiths, Donald G. Dansereau	The majority of image processing approaches assume images are in or can be rectified to a perspective projection. However, in many applications it is beneficial to use non conventional cameras, such as fisheye cameras, that have a larger field of view (FOV). The issue arises that these large-FOV images can't be rectified to a perspective projection without significant cropping of the original image. To address this issue we propose Rectified Convolutions (RectConv); a new approach for adapting pre-trained convolutional networks to operate with new non-perspective images, without any retraining. Replacing the convolutional layers of the network with RectConv layers allows the network to see both rectified patches and the entire FOV. We demonstrate RectConv adapting multiple pre-trained networks to perform segmentation and detection on fisheye imagery from two publicly available datasets. Our approach requires no additional data or training, and operates directly on the native image as captured from the camera. We believe this work is a step toward adapting the vast resources available for perspective images to operate across a broad range of camera geometries.	This paper proposes Rectified Convolutions (RectConv), a method for adapting pre-trained convolutional networks to operate with new non-perspective images without retraining.	Adapting neural networks to new camera technologies typically requires gathering large datasets, even when the operating environment is the same. This work allows for the use of pre-trained networks on novel camera geometries without retraining or significant preprocessing.	RectConv modifies convolutional layers to adapt kernel shape to local image geometry using camera calibration parameters. This allows for the processing of distorted images without the need for rectification.	RectConv outperforms naive application of pre-trained networks and image rectification methods on fisheye imagery. The method effectively adapts segmentation and detection networks trained on conventional imagery to work with fisheye images from the Woodscape and PIROPO datasets. Converting only the backbone of the network to RectConv yields the most significant performance improvement.	Bounding box conversion for object detection in RectConv networks requires further improvement. Future work includes demonstrating RectConv on additional tasks and camera geometries, as well as expanding network conversion to handle more layer types (e.g., deconvolution).	fisheye, convolutions, large-fov, cameras, deep learning
2404.08181 Report	Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation	Sina Hajimiri, Ismail Ben Ayed, Jose Dolz	Despite the significant progress in deep learning for dense visual recognition problems, such as semantic segmentation, traditional methods are constrained by fixed class sets. Meanwhile, vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks, owing to their robust generalizability. Recently, a body of work has investigated utilizing these models in open-vocabulary semantic segmentation (OVSS). However, existing approaches often rely on impractical supervised pre-training or access to additional pre-trained networks. In this work, we propose a strong baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of CLIP tailored for this scenario. Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature. By incorporating design choices favouring segmentation, our approach significantly improves performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning, making it highly practical for real-world applications. Experiments are performed on 8 popular semantic segmentation benchmarks, yielding state-of-the-art performance on most scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP .	This paper introduces NACLIP, a training-free open-vocabulary semantic segmentation method that enhances CLIP's localization capability for pixel-wise prediction by enforcing spatial consistency in attention maps within the visual encoder.	Existing open-vocabulary semantic segmentation methods rely on impractical supervised pre-training or auxiliary pre-trained networks, limiting their real-world applicability. This work addresses the need for a more practical training-free approach.	NACLIP removes the CLS token, modifies the self-attention module to incorporate spatial consistency using a Gaussian kernel, employs a key-based similarity measure, and simplifies the final encoder block architecture for better dense prediction.	NACLIP achieves state-of-the-art performance on 7 out of 8 popular OVSS benchmarks without requiring additional data or fine-tuning. It demonstrates robustness to different CLIP visual backbones. Qualitative results highlight NACLIP's improved object boundary detection and contextual understanding compared to other methods.	The study acknowledges the potential relevance of the CLS token for dense prediction and suggests further investigation. Future work could explore incorporating additional cues or refining the model for improved performance on specific datasets.	semantic segmentation, open-vocabulary, training-free, clip, vision transformer
2404.08111 Report	S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing	Guangzhi Wang, Tianyi Chen, Kamran Ghasedi, HsiangTao Wu, Tianyu Ding, Chris Nuesmeyer, Ilya Zharkov, Mohan Kankanhalli, Luming Liang	Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.	This paper presents S3Editor, a novel Sparse Semantic-disentangled Self-training framework for improving existing face video editing approaches.	Current face video editing methods struggle to balance high-quality results with identity preservation, editing faithfulness, and temporal consistency due to limitations in training data, architecture, and optimization strategies.	S3Editor utilizes a self-training paradigm with pseudo-edited data, a semantic disentangled architecture for diverse edits, and a structured sparse learning schema to deactivate irrelevant neurons and minimize over-editing.	S3Editor significantly enhances identity preservation and editing faithfulness compared to existing methods. The framework improves temporal consistency across video frames, even without explicit temporal constraints. The semantic disentanglement and sparse learning strategies allow for localized edits, minimizing unwanted changes to unrelated facial features.	The current implementation requires a predefined set of attributes for clustering, potentially limiting its generalization to entirely novel edits. Future work could explore alternative neuron grouping strategies beyond landmark-based partitioning for sparse learning.	face video editing, self-training, semantic disentanglement, sparse learning, temporal consistency
2404.08031 Report	Latent Guard: a Safety Framework for Text-to-image Generation	Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati	With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://github.com/rt219/LatentGuard.	Introduces Latent Guard, a framework for improving safety measures in text-to-image generation by detecting blacklisted concepts in the latent space of input text embeddings.	Existing safety measures like text blacklists are easily circumvented, while harmful content classifiers require large datasets and lack flexibility.	Uses contrastive learning to train an Embedding Mapping Layer on top of pretrained text encoders. This layer maps embeddings of blacklisted concepts and prompts containing them closer together in a latent space.	Outperforms baselines like Text Blacklists, CLIPScore, BERTScore, and LLM-based classifiers in detecting unsafe prompts. Demonstrates robustness against adversarial attacks targeting the text encoder. Generalizes well to unseen datasets and concepts, allowing for flexible blacklist modifications at test time.	Performance heavily relies on the comprehensiveness of the blacklisted concepts. LLM-generated training data may not fully represent real-world input distributions.	text-to-image generation, safety, contrastive learning, latent space, adversarial attacks
2404.08030 Report	Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models	Mazda Moayeri, Samyadeep Basu, Sriram Balasubramanian, Priyatham Kattakinda, Atoosa Chengini, Robert Brauneis, Soheil Feizi	Recent text-to-image generative models such as Stable Diffusion are extremely adept at mimicking and generating copyrighted content, raising concerns amongst artists that their unique styles may be improperly copied. Understanding how generative models copy "artistic style" is more complex than duplicating a single image, as style is comprised by a set of elements (or signature) that frequently co-occurs across a body of work, where each individual work may vary significantly. In our paper, we first reformulate the problem of "artistic copyright infringement" to a classification problem over image sets, instead of probing image-wise similarities. We then introduce ArtSavant, a practical (i.e., efficient and easy to understand) tool to (i) determine the unique style of an artist by comparing it to a reference dataset of works from 372 artists curated from WikiArt, and (ii) recognize if the identified style reappears in generated images. We leverage two complementary methods to perform artistic style classification over image sets, includingTagMatch, which is a novel inherently interpretable and attributable method, making it more suitable for broader use by non-technical stake holders (artists, lawyers, judges, etc). Leveraging ArtSavant, we then perform a large-scale empirical study to provide quantitative insight on the prevalence of artistic style copying across 3 popular text-to-image generative models. Namely, amongst a dataset of prolific artists (including many famous ones), only 20% of them appear to have their styles be at a risk of copying via simple prompting of today's popular text-to-image generative models.	This paper introduces ArtSavant, a tool designed to detect and articulate potential artistic style copying by text-to-image generative models.	The rise of AI models capable of mimicking artistic styles raises copyright concerns for artists. This work addresses the need for a practical and interpretable tool to identify and analyze potential style infringements.	The authors curate a dataset of artworks from 372 prolific artists and develop two complementary methods: DeepMatch (a black-box neural network classifier) and TagMatch (an interpretable tag-based classifier using CLIP and a novel tag composition method). They apply these methods to generated images from popular text-to-image models, analyzing match rates and confidences.	DeepMatch achieves 89.3% accuracy on real art, indicating the existence of unique artistic styles for most artists. Analysis of generated images reveals that only about 20% of the artists studied are at high risk of style copying by current generative models using simple prompting. TagMatch provides interpretable and attributable evidence of style copying by identifying shared tag signatures between generated images and reference artists.	The study's scope is limited to 372 artists, which may not fully represent the vast diversity of artistic styles. The atomic tagging method, while precise, relies on CLIP and may not capture all nuances of artistic style.	artistic style copying, copyright infringement, text-to-image generation, deep learning, interpretability
2404.07993 Report	Connecting NeRFs, Images, and Text	Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano	Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.	This paper proposes a novel framework to connect Neural Radiance Fields (NeRFs) with other modalities like images and text, enabling applications like zero-shot NeRF classification and NeRF retrieval.	As NeRFs become a standard for 3D scene representation, connecting them with existing modalities (like text and images) unlocks new possibilities for information exchange, storage, and multimodal applications.	The framework leverages pre-trained models like CLIP and NF2Vec to learn bidirectional mapping between NeRF embeddings and embeddings from corresponding images and text using two simple MLPs.	The framework enables zero-shot NeRF classification with accuracy comparable to methods relying on rendered images, but without rendering a single pixel. It allows retrieval of NeRFs from both image and text queries, achieving competitive performance compared to baselines. An adaptation technique using ControlNet is proposed to improve NeRF retrieval from real-world images.	The current work is limited to synthetic objects due to reliance on NF2Vec trained on ShapeNet. NeRF generation is constrained by the capabilities of the NF2Vec decoder.	neural radiance fields, nerf, multimodal learning, vision-language models, zero-shot classification
2404.07991 Report	GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh	Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, Shenlong Wang	We introduce GoMAvatar, a novel approach for real-time, memory-efficient, high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints, while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).	Introduces \method, a novel framework for real-time, memory-efficient, high-quality animatable human modeling from a single monocular video.	High-fidelity, animatable digital avatars are crucial for various applications, but conventional methods are slow, expensive, and cumbersome. Affordable methods using only monocular RGB videos are highly desirable.	Presents the Gaussians-on-Mesh (GoM) representation, combining rendering quality and speed of Gaussian splatting with the geometry modeling and compatibility of deformable meshes. Leverages Gaussian splats for rendering and a skeleton-driven deformable mesh for articulation. Employs a differentiable shading module to handle view dependency.	\method matches or surpasses state-of-the-art monocular human modeling algorithms in rendering quality. It significantly outperforms competitors in computational efficiency, achieving a rendering speed of 43 FPS on an NVIDIA A100 GPU. \method is memory-efficient, requiring only 3.63 MB per subject.	Limited ability to hallucinate unseen regions. Challenges in handling significant topology changes, such as dynamically moving clothing parts.	human modeling, animatable avatars, monocular video, gaussians-on-mesh, real-time rendering
2404.07990 Report	OpenBias: Open-set Bias Detection in Text-to-Image Generative Models	Moreno D'Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, Nicu Sebe	Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In this paper, we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set. OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated before. Via quantitative experiments, we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.	Proposes OpenBias, the first open-set bias detection pipeline for text-to-image generative models that identifies, recognizes, and quantifies biases without predefined categories.	Existing bias detection methods rely on pre-defined bias categories, limiting their scope and ability to uncover novel biases, which is crucial as AI-generated content becomes increasingly prevalent.	OpenBias leverages a Large Language Model (LLM) to propose potential biases and generate related questions from a dataset of captions. Then, a Vision Question Answering (VQA) model assesses the presence and severity of those biases in images generated by the target generative model.	OpenBias successfully identifies both well-known and previously unexplored biases across different versions of Stable Diffusion. The pipeline demonstrates a strong agreement with FairFace, a classifier trained for fair predictions, and aligns well with human judgment in a user study. The context-aware analysis highlights the influence of caption elements on bias perpetuation, revealing varying bias intensity depending on the context.	OpenBias relies on the performance of the underlying LLM and VQA models, inheriting their limitations and potential biases. The study primarily focuses on qualitative analysis of context-aware biases, leaving room for more systematic quantitative investigation in future work.	bias detection, text-to-image generation, open-set recognition, large language models, vision question answering
2404.07987 Report	ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback	Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, Chen Chen	To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.	ControlNet++ improves the controllability of text-to-image diffusion models by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls using a pre-trained discriminative reward model.	Existing controllable generation methods struggle to accurately align generated images with input image conditions, hindering precise and fine-grained control.	The method disrupts the consistency between training images and conditions by adding noise. Then, it uses single-step denoised images for efficient reward fine-tuning, optimizing the consistency between input and predicted conditions (e.g., segmentation masks).	ControlNet++ significantly outperforms existing methods in terms of controllability across various conditional controls (e.g., segmentation masks, depth maps, edges). It achieves this without compromising image quality, as evidenced by FID scores comparable or superior to baselines. Images generated by ControlNet++ are effective for downstream tasks, demonstrated by improved performance when used to train a segmentation model.	The method's current focus is primarily on controllability, with future work aiming to incorporate quality and aesthetics through human feedback. Expanding the range of controllable conditions (e.g., human pose, scribbles) is another avenue for future development.	controllable generation, diffusion model, controlnet, consistency feedback, reward fine-tuning
2404.07973 Report	Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang	While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.	Introduces Ferret-v2, an upgraded version of the Ferret model for multimodal understanding, featuring enhanced capabilities for handling referring and grounding at any resolution.	Addresses the limitations of existing MLLMs in handling high-resolution images and fine-grained visual details for tasks involving regional understanding, such as referring and grounding.	Employs a multi-granularity visual encoding strategy using CLIP for global context and DINOv2 for local details, along with a three-stage training paradigm (image-caption alignment, high-resolution dense alignment, intent-enhanced instruction tuning) to bridge global and local visual understanding.	Achieves significant performance improvements over the original Ferret and other state-of-the-art models in tasks like Referring Object Classification (ROC) and Referring Expression Comprehension (REC). Demonstrates enhanced ability to handle higher image resolutions, leading to improved accuracy in identifying small objects and details. Exhibits competitive performance on modern MLLM benchmarks by incorporating task-specific datasets and a strategic prompting approach to bridge the gap between regional and global understanding.	Potential for generating harmful or counterfactual responses, a common limitation in MLLMs. Limited exploration of different vision encoders for multi-granularity visual encoding.	multimodal learning, large language models, referring and grounding, high-resolution image understanding, multi-granularity visual encoding
2404.07949 Report	Taming Stable Diffusion for Text to 360° Panorama Image Generation	Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, Jianfei Cai	Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.	Introduces PanFusion, a novel dual-branch diffusion model for generating high-quality, consistent 360° panoramas from text prompts.	Addresses the limitations of existing text-to-panorama generation methods, which struggle with issues like loop closure, repetitive elements, and visual inconsistency.	Leverages a panorama branch for global layout guidance and a perspective branch to exploit Stable Diffusion's strengths in perspective image generation. Employs an Equirectangular-Perspective Projection Attention (EPPA) mechanism to ensure consistency between the branches.	Outperforms state-of-the-art methods in terms of realism and consistency in both panorama and perspective views. Effectively integrates room layout as an additional condition for customized panorama generation. Demonstrates strong generalization ability to out-of-domain prompts, including outdoor scenes.	Higher computational complexity due to the dual-branch architecture. Occasional failure to generate entrances for indoor scenes.	panorama generation, text-to-image synthesis, diffusion models, equirectangular projection, layout-conditioned generation
2404.07933 Report	Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation	Keonhee Han, Dominik Muhle, Felix Wimbauer, Daniel Cremers	Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more recently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches. e.g. voxel-based methods, density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction, especially in occluded regions.	This paper presents a novel method for improving single-view 3D scene completion by leveraging information from multiple views.	Accurate 3D scene understanding is essential for robotics and autonomous driving, and single-view methods often struggle with occlusions.	The authors first train a multi-view density field reconstruction model (MVBTS) in a self-supervised manner. Then, they use knowledge distillation to train a single-view model (KDBTS) supervised by the MVBTS predictions.	MVBTS effectively fuses density information from multiple views, leading to accurate scene reconstructions. KDBTS achieves state-of-the-art occupancy prediction on the KITTI-360 benchmark, outperforming previous single-view methods. Knowledge distillation from multi-view predictions provides a strong supervisory signal for single-view scene completion.	The method assumes a static scene, limiting its performance in dynamic environments. Future work can focus on modeling dynamic objects and view-dependent effects to further enhance reconstruction accuracy.	scene completion, density fields, knowledge distillation, multi-view learning, self-supervised learning
2404.07850 Report	MindBridge: A Cross-Subject Brain Decoding Framework	Shizun Wang, Songhua Liu, Zhenxiong Tan, Xinchao Wang	Brain decoding, a pivotal field in neuroscience, aims to reconstruct stimuli from acquired brain signals, primarily utilizing functional magnetic resonance imaging (fMRI). Currently, brain decoding is confined to a per-subject-per-model paradigm, limiting its applicability to the same individual for whom the decoding model is trained. This constraint stems from three key challenges: 1) the inherent variability in input dimensions across subjects due to differences in brain size; 2) the unique intrinsic neural patterns, influencing how different individuals perceive and process sensory information; 3) limited data availability for new subjects in real-world scenarios hampers the performance of decoding models. In this paper, we present a novel approach, MindBridge, that achieves cross-subject brain decoding by employing only one model. Our proposed framework establishes a generic paradigm capable of addressing these challenges by introducing biological-inspired aggregation function and novel cyclic fMRI reconstruction mechanism for subject-invariant representation learning. Notably, by cycle reconstruction of fMRI, MindBridge can enable novel fMRI synthesis, which also can serve as pseudo data augmentation. Within the framework, we also devise a novel reset-tuning method for adapting a pretrained model to a new subject. Experimental results demonstrate MindBridge's ability to reconstruct images for multiple subjects, which is competitive with dedicated subject-specific models. Furthermore, with limited data for a new subject, we achieve a high level of decoding accuracy, surpassing that of subject-specific models. This advancement in cross-subject brain decoding suggests promising directions for wider applications in neuroscience and indicates potential for more efficient utilization of limited fMRI data in real-world scenarios. Project page: https://littlepure2333.github.io/MindBridge	This paper proposes MindBridge, a novel framework for cross-subject brain decoding using fMRI, overcoming the limitations of subject-specific models by learning subject-invariant representations.	Current brain decoding requires training one model per subject, limiting its applicability. MindBridge allows a single model to decode brain signals from multiple subjects, enabling broader applications in neuroscience and efficient use of limited fMRI data.	MindBridge utilizes an adaptive signal aggregation function to unify fMRI signal sizes across subjects and a cyclic fMRI reconstruction mechanism for subject-invariant representation learning. It also introduces a reset-tuning strategy for adapting to new subjects with limited data.	MindBridge achieves comparable brain decoding performance to state-of-the-art subject-specific methods using only one model. It effectively adapts to new subjects with limited data, surpassing methods trained from scratch. MindBridge enables novel fMRI synthesis, transforming one subject's fMRI signal into another's while preserving semantic meaning.	Evaluation is limited to the NSD dataset due to the scarcity of high-quality fMRI data. The serialization of fMRI signals as 1D vectors might disrupt the original spatial relationships.	brain decoding, fmri, cross-subject learning, diffusion models, neuroscience
2404.07794 Report	DGMamba: Domain Generalization via Generalized State Space Model	Shaocong Long, Qianyu Zhou, Xiangtai Li, Xuequan Lu, Chenhao Ying, Yuan Luo, Lizhuang Ma, Shuicheng Yan	Domain generalization~(DG) aims at solving distribution shift problems in various scenes. Existing approaches are based on Convolution Neural Networks (CNNs) or Vision Transformers (ViTs), which suffer from limited receptive fields or quadratic complexities issues. Mamba, as an emerging state space model (SSM), possesses superior linear complexity and global receptive fields. Despite this, it can hardly be applied to DG to address distribution shifts, due to the hidden state issues and inappropriate scan mechanisms. In this paper, we propose a novel framework for DG, named DGMamba, that excels in strong generalizability toward unseen domains and meanwhile has the advantages of global receptive fields, and efficient linear complexity. Our DGMamba compromises two core components: Hidden State Suppressing~(HSS) and Semantic-aware Patch refining~(SPR). In particular, HSS is introduced to mitigate the influence of hidden states associated with domain-specific features during output prediction. SPR strives to encourage the model to concentrate more on objects rather than context, consisting of two designs: Prior-Free Scanning~(PFS), and Domain Context Interchange~(DCI). Concretely, PFS aims to shuffle the non-semantic patches within images, creating more flexible and effective sequences from images, and DCI is designed to regularize Mamba with the combination of mismatched non-semantic and semantic information by fusing patches among domains. Extensive experiments on four commonly used DG benchmarks demonstrate that the proposed DGMamba achieves remarkably superior results to state-of-the-art models. The code will be made publicly available.	This paper introduces DGMamba, a novel state space model-based framework for domain generalization, aiming to improve the generalizability of models like Mamba on unseen domains while preserving their global receptive fields and linear complexity advantages.	Existing CNN- or ViT-based domain generalization methods suffer from limitations such as local receptive fields (CNNs) or quadratic complexities (ViTs). Mamba, as a state space model, holds promise but lacks inherent mechanisms to handle domain shifts effectively.	DGMamba tackles these issues with two core components: 1) Hidden State Suppressing (HSS) mitigates the impact of domain-specific information accumulated in hidden states during propagation. 2) Semantic-aware Patch Refining (SPR), comprising Prior-Free Scanning (PFS) and Domain Context Interchange (DCI), encourages the model to focus on objects rather than domain-specific context by shuffling and interchanging non-semantic patches.	DGMamba significantly outperforms state-of-the-art domain generalization methods on five benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet). Ablation studies confirm that HSS, PFS, and DCI all contribute to the performance improvement. DGMamba achieves superior generalization performance with fewer parameters and lower computational complexity compared to CNN- and ViT-based counterparts.	Exploration of feature/domain prompts in SSM-based models for enhanced representation learning. Extension of DGMamba to high-structure tasks like domain-generalized semantic segmentation and object detection.	domain generalization, state space model, mamba, hidden state suppressing, semantic-aware patch refining
2404.07771 Report	An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization	Minshuo Chen, Song Mei, Jianqing Fan, Mengdi Wang	Diffusion models, a powerful and universal generative AI technology, have achieved tremendous success in computer vision, audio, reinforcement learning, and computational biology. In these applications, diffusion models provide flexible high-dimensional data modeling, and act as a sampler for generating new samples under active guidance towards task-desired properties. Despite the significant empirical success, theory of diffusion models is very limited, potentially slowing down principled methodological innovations for further harnessing and improving diffusion models. In this paper, we review emerging applications of diffusion models, understanding their sample generation under various controls. Next, we overview the existing theories of diffusion models, covering their statistical properties and sampling capabilities. We adopt a progressive routine, beginning with unconditional diffusion models and connecting to conditional counterparts. Further, we review a new avenue in high-dimensional structured optimization through conditional diffusion models, where searching for solutions is reformulated as a conditional sampling problem and solved by diffusion models. Lastly, we discuss future directions about diffusion models. The purpose of this paper is to provide a well-rounded theoretical exposure for stimulating forward-looking theories and methods of diffusion models.	This paper reviews the theory and applications of diffusion models, a powerful class of generative AI models, focusing on their ability to learn data distributions and generate new samples under various controls.	Despite the significant empirical success of diffusion models, their theoretical understanding lags behind, potentially hindering further methodological innovations.	The paper reviews existing theoretical results on diffusion models, covering score function approximation and estimation, sampling guarantees, and distribution learning. It adopts a progressive approach, starting with unconditional models and extending to conditional ones.	Diffusion models can efficiently learn complex data distributions, achieving minimax optimal rates for distribution estimation. The sample complexity of score estimation in diffusion models can be significantly reduced when data lie on a low-dimensional subspace, circumventing the curse of dimensionality. Conditional diffusion models can be used for black-box optimization by formulating it as a conditional sampling problem, generating high-fidelity solutions that optimize a reward function while preserving data latent structures.	Theoretical understanding of conditional diffusion models, especially regarding guidance design and adaptation to specific tasks, remains limited. Principled methodologies for tuning the strength of guidance in conditional diffusion models are still lacking.	diffusion models, generative ai, score matching, sample complexity, black-box optimization
2404.07724 Report	Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models	Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen	Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.	This paper proposes limiting classifier-free guidance to a specific interval of noise levels during the sampling process of diffusion models, rather than applying it constantly.	Constant guidance throughout the sampling chain can be detrimental, especially at high and low noise levels. This work shows that restricting guidance to a specific interval improves both image quality and inference speed.	The authors modify the diffusion ODE to incorporate a piecewise constant guidance weight function, enabling guidance only within a defined interval of noise levels. They evaluate their method using ImageNet with EDM2 and DiT-XL/2 models and qualitatively analyze Stable Diffusion XL outputs.	Limiting the guidance interval significantly improves FID scores on ImageNet-512, achieving a new state-of-the-art of 1.40 with EDM2-XXL. The method consistently improves results across different sampler parameters, network architectures (EDM2, DiT, Stable Diffusion XL), and datasets. Qualitative analysis reveals that the proposed technique leads to better preservation of image composition and more natural color saturation compared to standard CFG.	The optimal guidance interval is currently determined through grid search or visual inspection, and future work could explore automatic estimation methods. Further investigation into the interaction between guidance and non-ideal aspects of trained denoisers is needed.	diffusion models, classifier-free guidance, image generation, sampling techniques, fid
2404.07600 Report	Implicit and Explicit Language Guidance for Diffusion-based Visual Perception	Hefeng Wang, Jiale Cao, Jin Xie, Aiping Yang, Yanwei Pang	Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.	This paper proposes IEDP, an implicit and explicit language guidance framework leveraging pre-trained text-to-image diffusion models for visual perception tasks.	Existing methods for adapting diffusion models to perception tasks either rely on unaligned text prompts or require cumbersome caption generation during inference. This work aims to address these limitations.	IEDP consists of two branches: 1) Implicit branch: generates image-aligned text embeddings directly from input images using a frozen CLIP image encoder and a learnable adapter. 2) Explicit branch: utilizes ground-truth labels of training images as text prompts to condition feature extraction, jointly training the model with the implicit branch. Only the implicit branch is used during inference.	IEDP achieves a mIoU score of 55.9% on ADE20K for semantic segmentation, outperforming the baseline VPD by 2.2%. For depth estimation on NYUv2, IEDP attains an RMSE of 0.226, surpassing VPD by a relative gain of 11.0%. IEDP demonstrates a favorable trade-off between performance and inference time compared to existing diffusion-based perception methods.	The explicit branch currently relies on ground-truth labels during training, limiting its applicability to fully unsupervised settings. Future work could explore alternative approaches for generating implicit text embeddings, potentially incorporating object-level information.	diffusion models, language guidance, visual perception, semantic segmentation, depth estimation
2404.07554 Report	CAT: Contrastive Adapter Training for Personalized Image Generation	Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song	The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization.	This paper introduces Contrastive Adapter Training (CAT), a novel method for personalized image generation using diffusion models. CAT enhances adapter training, particularly LoRA, by preserving the base model's knowledge and preventing overfitting.	Personalized image generation is crucial for various applications, but existing adapter training methods often lead to knowledge corruption and poor generalization. This paper addresses this by introducing a method that preserves the original model's capabilities while enabling personalized generation.	CAT leverages a contrastive loss function that minimizes the difference in noise prediction between the original and adapted models without token conditioning. It encourages the adapter to specialize in personalized generation while retaining the base model's general knowledge.	CAT successfully mitigates underfitting and knowledge corruption problems in consistent generation adaptations. The paper introduces Knowledge Preservation Score (KPS), a novel metric to quantitatively assess the preservation of original model knowledge after adapter training. Qualitative and quantitative evaluations demonstrate CAT's effectiveness in preserving knowledge and achieving high-fidelity identity generation compared to existing methods.	The paper acknowledges the limitations of not including CLIP score-based diversity and fidelity calculation due to its instability. Future work aims to establish a reliable benchmark for consistent character generation, investigate CAT's impact on different domain knowledge, and expand CAT to support multi-concept training with per-token loss.	image generation, diffusion models, adapter training, personalization, knowledge preservation
2404.07448 Report	Transferable and Principled Efficiency for Open-Vocabulary Segmentation	Jingxuan Xu, Wuyang Chen, Yao Zhao, Yunchao Wei	Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges, they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models, which comes at a cost. In the context of efficient OVS, we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models, by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans	This paper proposes OpenTrans, a transferable open-vocabulary segmentation technique using smaller models and less training costs without sacrificing performance.	Current open-vocabulary segmentation (OVS) methods rely on large vision-language foundation models, leading to heavy computational overheads in model size and training costs, hindering their wider application.	The authors achieve efficiency through two steps: 1) iteratively prune the CLIP image encoder without semantic supervision to obtain a transferable subnetwork applicable to various OVS frameworks and 2) selectively fine-tune layers based on heavy-tail spectrum analysis of pretrained weights to reduce training costs.	Transferable subnetworks significantly reduce model size and computational costs (up to 54.4% and 47.2% respectively) while preserving or even improving OVS performance. Principled layer-selective fine-tuning further reduces training costs by up to 12%, leading to a cumulative reduction of 32.6% when combined with the subnetwork. OpenTrans achieves a strong balance between OVS accuracy and efficiency, outperforming previous methods in efficiency while maintaining competitive performance on diverse benchmarks.	The method currently focuses on convolutional backbones and can be extended to larger backbones or ViT architectures. Future work could explore more fine-grained weight selection for fine-tuning and application to other open-vocabulary tasks like object detection.	open-vocabulary segmentation, model efficiency, transferable subnetwork, efficient fine-tuning, heavy-tail analysis
2404.07389 Report	Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models	Yasi Zhang, Peiyu Yu, Ying Nian Wu	Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.	This paper proposes Object-Conditioned Energy-Based Attention Map Alignment (EBAMA) to enhance semantic alignment between generated images and text prompts in text-to-image diffusion models.	Existing text-to-image models often fail to capture the full semantic meaning of text prompts, leading to issues like incorrect attribute binding and object neglect.	The method leverages object-centric attention loss, derived from maximizing the log-likelihood of an object-conditioned Energy-Based Model (EBM), to align attention maps and an intensity regularizer to ensure object presence.	EBAMA outperforms previous methods in quantitative metrics (Full Sim., Min. Sim., T-C Sim.) across AnE, DVMP, and ABC-6K datasets. Human evaluation confirms EBAMA's superiority in text-image alignment, particularly for complex, natural-language prompts. The method effectively enhances text-controlled attribute editing capabilities compared to methods like PtP.	The method's effectiveness is limited by the expressive power of the base Stable Diffusion model. EBAMA degrades to regular diffusion model generation when no objects are present in the text prompt.	text-to-image synthesis, diffusion models, attention mechanisms, semantic alignment, energy-based models
2404.07206 Report	GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models	Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu	In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is https://gooddrag.github.io.	This paper introduces GoodDrag, a novel approach for drag editing that improves stability and image quality by alternating drag and denoising operations (AlDD) within the diffusion process and using information-preserving motion supervision.	Existing drag editing methods suffer from instability, distortions, and inaccurate point control, especially in diffusion-based approaches. GoodDrag addresses these issues, enabling more precise and higher-quality edits.	GoodDrag alternates drag operations with denoising steps throughout the diffusion process (AlDD) to prevent accumulation of perturbations. It also introduces information-preserving motion supervision to maintain the original features of the starting point during dragging.	GoodDrag effectively reduces artifacts and improves the accuracy of point movement compared to existing methods. Quantitative evaluations using the proposed Dragging Accuracy Index (DAI) and Gemini Score (GScore) demonstrate GoodDrag's superior performance. A user study confirms GoodDrag's ability to achieve more precise and visually appealing drag editing results.	GoodDrag's reliance on iterative optimization can lead to longer processing times. Future work includes exploring GoodDrag's integration with other image editing techniques and extending it to video editing.	drag editing, diffusion models, image manipulation, information-preserving motion supervision, alternating drag and denoising
2404.07204 Report	BRAVE: Broadening the visual encoding of vision-language models	Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari	Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.	This paper introduces BRAVE, a method for enhancing vision-language models (VLMs) by consolidating features from multiple vision encoders with diverse biases.	Existing VLMs often suffer from limitations due to the restricted capabilities of single vision encoders, such as blindness to specific image features or visual hallucinations.	BRAVE utilizes a novel multi-encoder querying transformer (MEQT) to efficiently combine features from various frozen vision encoders into a compact visual representation. This representation serves as a soft visual prompt for a frozen language model, requiring minimal trainable parameters.	BRAVE achieves state-of-the-art performance on various captioning (COCO, NoCaps) and VQA benchmarks (OKVQA, GQA, VizWiz-QA, MMVP, POPE). It exhibits improved robustness against out-of-distribution inputs and visual hallucinations compared to single-encoder VLMs. BRAVE maintains efficiency with fewer trainable parameters and lower pre-training data requirements than several existing methods.	The current design requires forward passes from all encoders, potentially limiting inference speed. Future work could explore adaptive mechanisms for encoder selection. While BRAVE demonstrates improved sample efficiency, further research is needed to reduce its reliance on large pre-training datasets.	vision-language models, multi-encoder fusion, visual prompting, image captioning, visual question answering
2404.07191 Report	InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models	Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan	We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.	InstantMesh is a feed-forward framework for fast, high-quality 3D mesh generation from a single image.	Creating 3D assets from single-view images is valuable for various applications, including virtual reality, industrial design, and entertainment. InstantMesh addresses limitations in speed and quality of previous methods.	The framework uses a two-stage approach: (1) Generates multi-view images from a single input image using a fine-tuned Zero123++ diffusion model. (2) Reconstructs a 3D mesh from these images using a sparse-view large reconstruction model, integrating a differentiable iso-surface extraction module for efficiency and geometric supervision.	Achieves state-of-the-art performance on image-to-3D generation, surpassing existing baselines in quantitative metrics and qualitative comparisons. Generates plausible novel views with high perceptual quality, as measured by SSIM and LPIPS metrics. Produces smoother and more reliable 3D geometry compared to methods using alternative representations like triplanes or Gaussians.	Limited triplane resolution from the transformer decoder might hinder high-definition modeling. Reliance on the diffusion model's multi-view consistency can impact the final quality; improved architectures are expected to mitigate this.	3d mesh generation, image-to-3d, diffusion models, large reconstruction models, generative ai
2404.07178 Report	Move Anything with Layered Scene Diffusion	Jiawei Ren, Mengmeng Xu, Jui-Chieh Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul	Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.	Introduces SceneDiffusion, a training-free approach for controllable scene generation and image editing using pre-trained text-to-image diffusion models.	Addresses the limitation of existing diffusion models in providing fine-grained spatial control due to their fixed forward noising process.	Optimizes a layered scene representation during the diffusion sampling process by jointly denoising multiple scene layouts at each step. This disentangles spatial layout from visual appearance.	Generates scenes where objects can be moved, resized, cloned, and their appearance can be edited independently. Enables object moving for in-the-wild images by using the sampling trajectory of a reference image as an anchor. Outperforms prior works on image quality and layout consistency metrics for both controllable scene generation and image editing tasks.	Object appearance may not perfectly align with the mask in the final rendered image. High memory consumption for simultaneous denoising of multiple layouts.	diffusion models, controllable scene generation, image editing, layered scene representation, spatial disentanglement
2404.07177 Report	Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic	Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter	Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.	The paper introduces the first neural scaling laws that consider data quality and compute budget for vision-language models trained on heterogeneous web data.	Existing data filtering methods for vision-language model training are agnostic of compute budget, leading to suboptimal performance at larger scales.	The authors propose scaling laws that model the diminishing utility of data with repetitions and formulate the interaction of data pools of varying quality to estimate model performance on combinations of these pools.	Data filtering strategies must be compute-aware, as the benefit of high-quality data diminishes with repetitions at large compute budgets. The proposed scaling laws accurately predict model performance on combinations of data pools without requiring training on these combinations. The scaling laws enable the identification of pareto-optimal data filtering strategies for different compute budgets, guiding data curation for vision-language models.	The scaling laws do not account for batch size variations, which can significantly impact contrastive learning performance. The consistency of scaling parameters across different data pool sizes needs further investigation to enable extrapolation to very large-scale training.	scaling laws, data filtering, vision-language models, contrastive learning, data curation
2404.07153 Report	Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations	Ofir Shifman, Yair Weiss	Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling 'natural' image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model's accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easy adjusted to deal with circular shifts as well. In such case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training.	This paper reveals that modern neural networks, including those trained on massive datasets and those designed for translation invariance, are still susceptible to small, realistic image translations, and proposes a method called Robust Inference by Crop Selection (RICS) to address this issue.	Robustness to small image transformations is crucial for reliable performance in real-world applications, especially as deep neural networks are increasingly used as foundational models for various tasks.	The RICS method enhances robustness by deterministically selecting a sub-crop from the input image during inference, ensuring consistency in feature representation despite translations. The paper provides theoretical analysis and experimental validation of RICS.	Even a single-pixel translation can significantly alter the predictions of state-of-the-art models like open-CLIP and DINO-v2. Methods designed for cyclic translation invariance remain vulnerable to realistic, non-cyclic translations. RICS significantly improves robustness to realistic translations, achieving over 95% adversarial robustness with minimal impact on accuracy.	The theoretical guarantee of robustness diminishes with increasing translation size. The current method only handles integer translations, limiting its applicability to sub-pixel shifts.	robustness, translation invariance, neural networks, image classification, deep learning
2404.07106 Report	3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion	Yixuan Li, Weidong Yang, Ben Fei	Point cloud completion aims to generate a complete and high-fidelity point cloud from an initially incomplete and low-quality input. A prevalent strategy involves leveraging Transformer-based models to encode global features and facilitate the reconstruction process. However, the adoption of pooling operations to obtain global feature representations often results in the loss of local details within the point cloud. Moreover, the attention mechanism inherent in Transformers introduces additional computational complexity, rendering it challenging to handle long sequences effectively. To address these issues, we propose 3DMambaComplete, a point cloud completion network built on the novel Mamba framework. It comprises three modules: HyperPoint Generation encodes point cloud features using Mamba's selection mechanism and predicts a set of Hyperpoints. A specific offset is estimated, and the down-sampled points become HyperPoints. The HyperPoint Spread module disperses these HyperPoints across different spatial locations to avoid concentration. Finally, a deformation method transforms the 2D mesh representation of HyperPoints into a fine-grained 3D structure for point cloud reconstruction. Extensive experiments conducted on various established benchmarks demonstrate that 3DMambaComplete surpasses state-of-the-art point cloud completion methods, as confirmed by qualitative and quantitative analyses.	This paper proposes 3DMambaComplete, a novel point cloud completion network based on a 3D Mamba architecture, that addresses the limitations of Transformer-based methods by achieving linear complexity and a global receptive field for effective completion of long sequences.	Existing Transformer-based point cloud completion methods suffer from loss of local details due to pooling operations and quadratic complexity of attention mechanisms, hindering their scalability.	3DMambaComplete utilizes a HyperPoint Generation module to produce hyperpoints, employs a HyperPoint Spread module to disperse them spatially, and uses a Point Deformation module to transform points into a high-quality 3D structure.	3DMambaComplete outperforms state-of-the-art methods on the PCN dataset, achieving the lowest chamfer distance in each category. The method significantly surpasses previous techniques on the real-world KITTI dataset, especially for highly incomplete data. 3DMambaComplete demonstrates superior performance on the ShapeNet55 dataset, particularly under high masking ratios, accurately reconstructing complete shapes with fine-grained details.	While 3DMambaComplete exhibits slightly higher parameters and FLOPS compared to some methods due to incorporating downsampled points, visualizations suggest their contribution to reconstruction effectiveness. Future work will focus on exploring the impact of different sampling strategies within the 3DMambaComplete framework to further enhance its efficiency.	point cloud completion, structured state space model, deep learning, mamba, hyperpoint
2404.06913 Report	Sparse Global Matching for Video Frame Interpolation with Large Motion	Chunxu Liu, Guozhen Zhang, Rui Zhao, Limin Wang	Large motion poses a critical challenge in Video Frame Interpolation (VFI) task. Existing methods are often constrained by limited receptive fields, resulting in sub-optimal performance when handling scenarios with large motion. In this paper, we introduce a new pipeline for VFI, which can effectively integrate global-level information to alleviate issues associated with large motion. Specifically, we first estimate a pair of initial intermediate flows using a high-resolution feature map for extracting local details. Then, we incorporate a sparse global matching branch to compensate for flow estimation, which consists of identifying flaws in initial flows and generating sparse flow compensation with a global receptive field. Finally, we adaptively merge the initial flow estimation with global flow compensation, yielding a more accurate intermediate flow. To evaluate the effectiveness of our method in handling large motion, we carefully curate a more challenging subset from commonly used benchmarks. Our method demonstrates the state-of-the-art performance on these VFI subsets with large motion.	This paper introduces a novel sparse global matching pipeline for Video Frame Interpolation (VFI) that effectively addresses the challenge of large motion.	Large motion poses significant difficulties for VFI tasks due to the limitations of local receptive fields in accurately estimating optical flow, leading to sub-optimal performance.	The proposed method uses a two-step strategy: (1) estimate initial intermediate flows using a high-resolution feature map for capturing local details, and (2) incorporate a sparse global matching branch to compensate for errors in the initial flow estimations, specifically targeting regions with large motion identified through a difference map.	The method achieves state-of-the-art performance on challenging VFI subsets with large motion, including X-Test-L, Xiph-L, and SNU-FILM-L. A significant improvement in PSNR is observed, reaching up to 0.66 dB enhancement by correcting errors in the initial flow estimation using the sparse global matching technique. The approach effectively combines local details with global correlations for accurate intermediate flow estimation, leading to improved visual quality in synthesized frames.	The primary limitation lies in the computational cost of the global feature extractor used in the sparse global matching branch. Future work can focus on exploring lighter and more efficient alternatives for global feature extraction or distilling knowledge from pre-trained optical flow models.	video frame interpolation, large motion, sparse global matching, optical flow, deep learning
2404.06903 Report	DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting	Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi	The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/	Presents DreamScene360, a method for unconstrained text-to-3D scene generation with panoramic Gaussian splatting, enabling the generation of immersive and geometrically consistent 360-degree 3D scenes from text prompts.	Addresses the limitations of previous text-to-3D methods that struggle with unbounded scenes, constrained viewpoints, and geometric inconsistencies. Offers a solution for generating immersive 3D experiences from text descriptions.	Employs a multi-round self-refinement module with GPT-4V for text prompt revision and panoramic image generation. Utilizes a pretrained text-to-360° panoramic image diffusion model and optimizes geometric fields with monocular depth estimation. Leverages panoramic Gaussian splatting for efficient and detailed 3D scene representation.	Generates high-fidelity and diverse 3D scenes from text prompts of varying specificity. Demonstrates superior performance compared to baseline methods in terms of geometric consistency and visual quality across different viewpoints. Successfully handles both bounded indoor and unbounded outdoor scenes, enabling immersive exploration.	Relies on pretrained models for panoramic image generation and depth estimation, potentially limiting generalization to unseen domains. Computational cost of optimizing Gaussian splatting representation can be high, particularly for complex scenes.	text-to-3d, 3d scene generation, panoramic gaussian splatting, gpt-4v, immersive experience
2404.06851 Report	UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion	Junsheng Zhou, Weiqi Zhang, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han	Diffusion models have shown remarkable results for image generation, editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions, i.e., signed distance function and occupancy function. However, they are limited to shapes with closed surfaces, which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation, which produces a compact representation space for UDF generation. Specifically, instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss, we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: https://weiqi-zhang.github.io/UDiFF.	UDiFF, a 3D diffusion model for generating textured 3D shapes with open surfaces from text conditions or unconditionally, leveraging an optimal wavelet transformation for compact UDF representation.	Existing 3D implicit diffusion models are limited to closed shapes, hindering the generation of diverse real-world content with open surfaces. This work addresses this limitation and introduces a novel approach for compact UDF representation.	UDiFF employs a data-driven approach to learn an optimal wavelet filter for UDF compression and reconstruction, minimizing information loss. It uses a conditional diffusion framework with cross-attention for text-guided generation and a fine predictor for high-fidelity results. Surfaces are extracted using DCUDF and textured with Text2Tex.	UDiFF outperforms state-of-the-art methods in generating open-surface shapes on DeepFashion3D. It achieves comparable performance to leading methods on closed-surface shape generation on ShapeNet. Ablation studies confirm the effectiveness of the optimal wavelet transformation and fine predictor.	The adapted DCUDF meshing may not be perfectly accurate for complex open surfaces. Exploring alternative meshing strategies and higher-resolution UDF generation are potential future directions.	3d shape generation, diffusion models, unsigned distance fields, text-to-3d, open surfaces
2404.06832 Report	SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection	Mathis Kruse, Marco Rudolph, Dominik Woiwode, Bodo Rosenhahn	Detecting anomalies in images has become a well-explored problem in both academia and industry. State-of-the-art algorithms are able to detect defects in increasingly difficult settings and data modalities. However, most current methods are not suited to address 3D objects captured from differing poses. While solutions using Neural Radiance Fields (NeRFs) have been proposed, they suffer from excessive computation requirements, which hinder real-world usability. For this reason, we propose the novel 3D Gaussian splatting-based framework SplatPose which, given multi-view images of a 3D object, accurately estimates the pose of unseen views in a differentiable manner, and detects anomalies in them. We achieve state-of-the-art results in both training and inference speed, and detection performance, even when using less training data than competing methods. We thoroughly evaluate our framework using the recently proposed Pose-agnostic Anomaly Detection benchmark and its multi-pose anomaly detection (MAD) data set.	This paper introduces SplatPose, a novel method for pose-agnostic anomaly detection in images using 3D Gaussian Splatting.	Existing methods for anomaly detection struggle with varying object poses. While NERF-based solutions exist, they are computationally expensive. This work aims to solve both challenges.	SplatPose represents objects as 3D Gaussian clouds learned from multi-view images. During inference, it estimates the pose of a query image by transforming the Gaussian cloud and compares the rendered image to the query to detect anomalies.	SplatPose achieves state-of-the-art anomaly detection results on the MAD dataset, outperforming NERF-based methods. It significantly reduces training time by 55x and inference time by 13x compared to competitors. The method demonstrates superior pose estimation accuracy compared to iNeRF, contributing to its improved anomaly detection.	Limitations include reliance on coarse pose estimation and the need for improvements in image feature comparison. Future work will focus on real-world data adaptation, application to human pose estimation, and integrating 3D information into 2D approaches.	anomaly detection, pose estimation, 3d gaussian splatting, novel view synthesis, computer vision
2404.06814 Report	Zero-shot Point Cloud Completion Via 2D Priors	Tianxin Huang, Zhiwen Yan, Yuyang Zhao, Gim Hee Lee	3D point cloud completion is designed to recover complete shapes from partially observed point clouds. Conventional completion methods typically depend on extensive point cloud data for training %, with their effectiveness often constrained to object categories similar to those seen during training. In contrast, we propose a zero-shot framework aimed at completing partially observed point clouds across any unseen categories. Leveraging point rendering via Gaussian Splatting, we develop techniques of Point Cloud Colorization and Zero-shot Fractal Completion that utilize 2D priors from pre-trained diffusion models to infer missing regions. Experimental results on both synthetic and real-world scanned point clouds demonstrate that our approach outperforms existing methods in completing a variety of objects without any requirement for specific training data.	This paper introduces a novel zero-shot framework for 3D point cloud completion, leveraging 2D priors from pre-trained diffusion models through Gaussian Splatting.	Existing completion methods are limited by training data diversity, struggling with unseen object categories. This method utilizes 2D priors to improve robustness and generalizability for unseen categories.	The method involves Point Cloud Colorization, estimating a reference viewpoint and generating a colorized image. Then, Zero-shot Fractal Completion optimizes 3D Gaussians guided by 2D priors from a diffusion model, conditioned on the reference image, to complete missing regions.	Outperforms state-of-the-art completion methods on synthetic data. Demonstrates superior performance on real-world scans, generalizing well to unseen categories. Successfully completes point clouds derived from LiDAR sensors, showcasing its versatility.	The optimization process for each point cloud can be time-consuming. Large gaps in edge regions of the reference view may lead to completion defects.	point cloud completion, gaussian splatting, diffusion model, zero-shot learning, 3d vision
2404.06780 Report	Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior	Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang	Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io.	Introduces Urban Architect, a method for steerable 3D urban scene generation leveraging 3D layout priors and text-to-image diffusion models, enabling large-scale, high-quality, and editable scene creation.	Existing text-to-3D methods struggle to handle the complexity and scale of urban scenes, lacking interpretability for ambiguous textual descriptions and suitable representations for unbounded environments.	Employs a compositional 3D layout representation with semantic primitives and introduces Layout-Guided Variational Score Distillation (LG-VSD) for layout-constrained optimization and a Scalable Hash Grid (SHG) structure for unbounded scene representation.	Generates large-scale urban scenes covering over 1000m driving distance with high quality and diversity. Outperforms previous 3D generative methods in FID and KID metrics on the KITTI-360 dataset. Supports diverse scene editing effects, including instance manipulation and style transfer, by leveraging the flexibility of the layout representation and diffusion models.	Current optimization process lacks pixel-level scene control. Future work explores integrating semantic segmentation into the pipeline for enhanced control.	text-to-3d generation, urban scene generation, 3d layout prior, score distillation sampling, scalable hash grid
2404.06773 Report	Adapting LLaMA Decoder to Vision Transformer	Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo	This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.	This paper investigates the adaptation of decoder-only Transformers, like LLaMA (originally for LLMs), to computer vision tasks by proposing a novel model called image LLaMA (iLLaMA).	This work aims to bridge the architectural gap between encoder-only visual models and decoder-only textual models, a timely and relevant issue in the era of LLMs.	The authors progressively modify a standard ViT to align with LLaMA's architecture, proposing techniques like 'post-sequence class token' to address attention collapse and a 'soft mask' strategy to facilitate optimization.	iLLaMA achieves competitive ImageNet-1K accuracy, reaching 75.1% with only 5.7M parameters and 86.0% with ~310M parameters after ImageNet-21K pre-training. causal self-attention in iLLaMA boosts computational efficiency and learns complex representations, as evidenced by higher attention map ranks. iLLaMA exhibits promising properties such as calibration, shape-texture bias, quantization compatibility, and transfer learning capabilities for semantic segmentation (ADE20K) and image classification (CIFAR).	iLLaMA's application is currently explored mainly for perception tasks, leaving room for investigating its potential in more complex tasks like reasoning and generation. The impact of the masking mechanism in iLLaMA's causal attention on high-resolution dense prediction tasks requires further investigation and optimization.	vision transformer, llama, decoder-only architecture, causal self-attention, image recognition
2404.06727 Report	Bayesian NeRF: Quantifying Uncertainty with Volume Density in Neural Radiance Fields	Sibeak Lee, Kyeongsu Kang, Hyeonwoo Yu	We present the Bayesian Neural Radiance Field (NeRF), which explicitly quantifies uncertainty in geometric volume structures without the need for additional networks, making it adept for challenging observations and uncontrolled images. NeRF diverges from traditional geometric methods by offering an enriched scene representation, rendering color and density in 3D space from various viewpoints. However, NeRF encounters limitations in relaxing uncertainties by using geometric structure information, leading to inaccuracies in interpretation under insufficient real-world observations. Recent research efforts aimed at addressing this issue have primarily relied on empirical methods or auxiliary networks. To fundamentally address this issue, we propose a series of formulational extensions to NeRF. By introducing generalized approximations and defining density-related uncertainty, our method seamlessly extends to manage uncertainty not only for RGB but also for depth, without the need for additional networks or empirical assumptions. In experiments we show that our method significantly enhances performance on RGB and depth images in the comprehensive dataset, demonstrating the reliability of the Bayesian NeRF approach to quantifying uncertainty based on the geometric structure.	This document provides a template and guidelines for submitting papers to ECCV [Year] conference.	It ensures consistent formatting, anonymity for double-blind review, and adherence to ECCV policies.	The paper details formatting rules for text, headings, figures, formulas, citations, references, and more. It also provides examples for anonymization and cross-referencing.	The document clarifies the double-blind review policy, emphasizing the importance of anonymization while still citing one's prior work appropriately. It specifies the strict page limit of 14 pages for the main content (excluding references) to maintain a fair and manageable review process. The guidelines aim to homogenize the submissions, aiding the reviewers in their task and ultimately benefiting both authors and readers.	The document assumes the use of LaTeX, although a Word template is available. Authors using Word are solely responsible for ensuring format consistency. Specific information regarding camera-ready manuscript preparation is deferred until after the paper decisions are announced, potentially leaving authors uninformed on certain aspects.	conference paper formatting, eccv, double-blind review, latex template, academic writing
2404.06542 Report	Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation	Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara	Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training.	This paper proposes FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation. This method utilizes diffusion models to localize generated concepts and leverages local-global similarities to match image regions with semantic classes.	Existing open-vocabulary segmentation methods, while effective, rely on computationally expensive training over large image-caption datasets. This paper offers a training-free alternative using diffusion models, which are known for their ability to visually ground generated concepts.	FreeDA works in two phases. First, in an offline phase, it generates textual-visual reference embeddings using diffusion models. These embeddings capture semantic instances with their textual and visual context. Second, at inference, these references are used to compute local and global similarities to segment an input image.	FreeDA achieves state-of-the-art performance on five datasets for open-vocabulary semantic segmentation, surpassing previous methods by a significant margin. The approach demonstrates the effectiveness of combining local similarities based on self-supervised visual features and global similarities from a multimodal encoder (CLIP). FreeDA shows robustness even without using superpixels for mask refinement, achieving competitive results and surpassing some PAMR-refined methods.	The method relies on the quality of the generated prototypes from the diffusion model; inaccurate generations could impact segmentation. Further research on effectively incorporating complex visual contexts and handling instances where objects are partially obscured could improve performance.	open-vocabulary semantic segmentation, diffusion models, training-free methods, local-global similarity, zero-shot learning
2404.06451 Report	SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions	Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo	Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.	Introduces SmartControl, a text-to-image generation method that modifies rough visual conditions to better align with user text prompts, enabling photorealistic image synthesis even with unaligned input conditions.	Existing layout-controllable text-to-image generation models struggle to produce high-quality images when the provided visual conditions (e.g., edges, depth maps) don't precisely match the user's textual description, leading to artifacts.	SmartControl leverages a Control Scale Predictor (CSP) to identify regions where visual conditions conflict with the text prompt. It then predicts a local control scale map, allowing the model to relax constraints in conflicting areas while preserving structural guidance from the visual condition.	Achieves superior image-text alignment (measured by CLIP Score) compared to state-of-the-art controllable generation methods on a dataset with rough conditions. Demonstrates robust generalization, effectively adapting to other text-to-image models like IP-Adapter without requiring retraining. Maintains high image quality and fidelity to user intent even with a limited training dataset (1,000-2,000 images) for the CSP.	Evaluation of self-similarity relies on pseudo-ground truths due to the unpaired nature of the dataset. Further exploration of alternative network architectures for the control scale predictor, potentially improving efficiency and performance.	text-to-image generation, controlnet, rough conditions, control scale predictor, image synthesis
2404.06429 Report	Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion	Fan Yang, Jianfeng Zhang, Yichun Shi, Bowen Chen, Chenxu Zhang, Huichao Zhang, Xiaofeng Yang, Jiashi Feng, Guosheng Lin	Benefiting from the rapid development of 2D diffusion models, 3D content creation has made significant progress recently. One promising solution involves the fine-tuning of pre-trained 2D diffusion models to harness their capacity for producing multi-view images, which are then lifted into accurate 3D models via methods like fast-NeRFs or large reconstruction models. However, as inconsistency still exists and limited generated resolution, the generation results of such methods still lack intricate textures and complex geometries. To solve this problem, we propose Magic-Boost, a multi-view conditioned diffusion model that significantly refines coarse generative results through a brief period of SDS optimization ($\sim15$min). Compared to the previous text or single image based diffusion models, Magic-Boost exhibits a robust capability to generate images with high consistency from pseudo synthesized multi-view images. It provides precise SDS guidance that well aligns with the identity of the input images, enriching the local detail in both geometry and texture of the initial generative results. Extensive experiments show Magic-Boost greatly enhances the coarse inputs and generates high-quality 3D assets with rich geometric and textural details. (Project Page: https://magic-research.github.io/magic-boost/)	This document provides author guidelines for submitting papers to ECCV.	This ensures consistent formatting and provides details on anonymity, dual submissions, and other policies.	The document outlines specific formatting instructions for language, template use, length, line numbering, headings, figures, formulas, footnotes, cross-references, program code, citations, and miscellaneous items.	Papers should be formatted using the official LNCS style from Springer. Submissions must be anonymized for double-blind review. The maximum paper length is 14 pages excluding references.	The document doesn't detail camera-ready manuscript preparation, this comes after paper decisions. Specifics on handling overlapping material with concurrent submissions could be more elaborate.	author guidelines, eccv, conference submission, double-blind review, lncs format
2404.06425 Report	ZeST: Zero-Shot Material Transfer from a Single Image	Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, Varun Jampani	We propose ZeST, a method for zero-shot material transfer to an object in the input image given a material exemplar image. ZeST leverages existing diffusion adapters to extract implicit material representation from the exemplar image. This representation is used to transfer the material using pre-trained inpainting diffusion model on the object in the input image using depth estimates as geometry cue and grayscale object shading as illumination cues. The method works on real images without any training resulting a zero-shot approach. Both qualitative and quantitative results on real and synthetic datasets demonstrate that ZeST outputs photorealistic images with transferred materials. We also show the application of ZeST to perform multiple edits and robust material assignment under different illuminations. Project Page: https://ttchengab.github.io/zest	Introduces ZeST, a zero-shot method for transferring materials from a single exemplar image to objects in another image, leveraging pre-trained diffusion models.	Addresses the challenging and time-consuming task of 2D material editing, eliminating the need for 3D models, explicit material properties, or training data.	Combines an image encoder (IP-Adapter) to extract material representation, depth-based ControlNet for geometry guidance, and inpainting diffusion with foreground decoloring for illumination cues.	Achieves high-fidelity material transfer while preserving object geometry and scene illumination. Outperforms baselines in both qualitative and quantitative comparisons, demonstrating superior material fidelity and photorealism. Enables applications like multi-object editing, 3D texturing, and lighting-aware material transfer.	Occasionally exhibits partial material transfer or blends multiple materials from the exemplar. Future work includes improving material localization within the exemplar and exploring user interaction for finer control.	material transfer, diffusion models, zero-shot learning, image editing, computer graphics
2404.06270 Report	3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis	Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, Yuchao Dai	In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/	This paper proposes a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis, exploiting 3D scene geometry to learn more coherent deformations.	Existing NeRF-based solutions lack geometric coherence in deformation, leading to unsatisfactory dynamic view synthesis and reconstruction.	The method leverages 3D Gaussian Splatting to represent scenes as a collection of 3D Gaussians. It extracts 3D geometry features using sparse convolution on voxelized Gaussian distributions and integrates these features into a deformation field that models Gaussian transformations over time. Continuous 6D rotation representation enhances accurate rotation estimation.	The method achieves state-of-the-art performance on synthetic and real dynamic scene datasets (D-NeRF and HyperNeRF). Ablation studies confirm the effectiveness of geometry-aware feature extraction, 6D rotation representation, and density control adaptations. Visualization results demonstrate accurate 3D reconstruction and temporal interpolation capabilities.	The method struggles with scenes containing points that abruptly appear or disappear. Performance is limited in handling complex motions and long video sequences.	dynamic view synthesis, gaussian splatting, 3d geometry, deformation modeling, neural radiance fields
2404.06244 Report	Anchor-based Robust Finetuning of Vision-Language Models	Jinwei Han, Zhiwen Lin, Zhongyisun Sun, Yingguo Gao, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia	We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization, i.e., i) domain shift such as natural to sketch images, and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably, the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target, which only provides the class information, such as ``a photo of a [CLASS]''. This is distinct from the process in that CLIP was pretrained, where there is abundant text supervision with rich semantic information. Therefore, we propose to compensate for the finetune process using auxiliary supervision with rich semantic information, which acts as anchors to preserve the OOD generalization. Specifically, two types of anchors are elaborated in our method, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP, thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks.	This paper proposes Anchor-based Robust Finetuning (ARF) to preserve the out-of-distribution (OOD) generalization of vision-language models during finetuning, addressing both domain shift and zero-shot learning.	Maintaining OOD generalization (domain shift and zero-shot learning) is crucial for pretrained vision-language models like CLIP, even after finetuning on downstream tasks, to ensure broad applicability.	ARF utilizes two types of anchors: text-compensated anchors (image paired with a generated caption) and image-text-pair anchors (retrieved from a dataset similar to CLIP's pretraining data) to regularize the finetuning process with auxiliary contrastive supervision.	ARF achieves state-of-the-art performance on domain shift benchmarks like ImageNet variants and DomainNet, surpassing conventional finetuning and other robust finetuning methods. ARF excels in zero-shot learning on diverse recognition tasks, maintaining high accuracy on unseen categories while other methods suffer significant degradation. Ablation studies confirm the effectiveness of both anchor types and the impact of caption quality on ARF's performance.	The reliance on pretrained captioners and retrieval methods introduces potential limitations in terms of caption quality and retrieval effectiveness. Exploring the use of larger language models (LLMs) for generating more diverse and informative captions presents a promising direction for future work.	vision-language models, robust finetuning, out-of-distribution generalization, domain shift, zero-shot learning
2404.06135 Report	Mansformer: Efficient Transformer of Mixed Attention for Image Deblurring and Beyond	Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang	Transformer has made an enormous success in natural language processing and high-level vision over the past few years. However, the complexity of self-attention is quadratic to the image size, which makes it infeasible for high-resolution vision tasks. In this paper, we propose the Mansformer, a Transformer of mixed attention that combines multiple self-attentions, gate, and multi-layer perceptions (MLPs), to explore and employ more possibilities of self-attention. Taking efficiency into account, we design four kinds of self-attention, whose complexities are all linear. By elaborate adjustment of the tensor shapes and dimensions for the dot product, we split the typical self-attention of quadratic complexity into four operations of linear complexity. To adaptively merge these different kinds of self-attention, we take advantage of an architecture similar to Squeeze-and-Excitation Networks. Furthermore, we make it to merge the two-staged Transformer design into one stage by the proposed gated-dconv MLP. Image deblurring is our main target, while extensive quantitative and qualitative evaluations show that this method performs favorably against the state-of-the-art methods far more than simply deblurring. The source codes and trained models will be made available to the public.	This paper proposes Mansformer, an efficient Transformer model using mixed attention for image deblurring and other image restoration tasks.	Existing Transformer models for image restoration struggle to balance computational complexity with the need for both global and local context. Mansformer addresses this with a novel mixed attention mechanism and a more efficient network architecture.	The Mansformer uses a combination of four linear complexity self-attention mechanisms: local spatial, local channel, global spatial, and global channel. It also introduces a gated-dconv MLP to merge the typical two-stage Transformer design into a single stage.	Mansformer achieves state-of-the-art performance on single-image motion deblurring benchmarks GoPro and HIDE. It outperforms previous best methods on image deblurring with JPEG artifacts (REDS dataset) and image deraining (Rain13K). Ablation studies demonstrate the contribution of each attention mechanism and the efficiency gain from the gated-dconv MLP.	The model's performance gain is less significant on tasks with smaller image sizes, like real image denoising. Future work could explore further optimization of the simplified channel attention module for better parameter efficiency.	image deblurring, vision transformer, mixed attention, image restoration, gated-dconv mlp
2404.06119 Report	DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation	Junkai Yan, Yipeng Gao, Qize Yang, Xihan Wei, Xuansong Xie, Ancong Wu, Wei-Shi Zheng	Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring solely to the overall description for generating 3D objects. For instance, ambiguity easily occurs when producing a T-shirt with distinct patterns on its front and back using a single overall text guidance. In this work, we propose DreamView, a text-to-image approach enabling multi-view customization while maintaining overall consistency by adaptively injecting the view-specific and overall text guidance through a collaborative text guidance injection module, which can also be lifted to 3D generation via score distillation sampling. DreamView is trained with large-scale rendered multi-view images and their corresponding view-specific texts to learn to balance the separate content manipulation in each view and the global consistency of the overall object, resulting in a dual achievement of customization and consistency. Consequently, DreamView empowers artists to design 3D objects creatively, fostering the creation of more innovative and diverse 3D assets. Code and model will be released at https://github.com/iSEE-Laboratory/DreamView.	The paper introduces DreamView, a text-to-3D generation approach that allows for customized appearances from different viewpoints while maintaining overall 3D consistency.	Existing text-to-3D methods struggle to customize specific viewpoints based on a single, shared text description, limiting their flexibility and creative potential.	DreamView employs an adaptive text guidance injection module that balances the influence of overall and view-specific text prompts within a diffusion model. It is first trained for text-to-image generation and then lifted to 3D generation via score distillation sampling.	DreamView-2D outperforms existing text-to-image models in generating images consistent with both overall and view-specific descriptions. DreamView-3D successfully generates 3D objects that adhere to detailed text prompts, showcasing unique appearances defined by each viewpoint. User study confirms that DreamView-3D is preferred for generating 3D assets that align with text descriptions and exhibit high visual quality.	Generated full-body characters sometimes have blurry faces due to the use of low-resolution training images. DreamView relies on consistent text descriptions for different viewpoints, meaning it cannot generate different objects from different views.	text-to-3d generation, view customization, diffusion models, score distillation sampling, text-to-image generation
2404.06109 Report	Revising Densification in Gaussian Splatting	Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder	In this paper, we address the limitations of Adaptive Density Control (ADC) in 3D Gaussian Splatting (3DGS), a scene representation method achieving high-quality, photorealistic results for novel view synthesis. ADC has been introduced for automatic 3D point primitive management, controlling densification and pruning, however, with certain limitations in the densification logic. Our main contribution is a more principled, pixel-error driven formulation for density control in 3DGS, leveraging an auxiliary, per-pixel error function as the criterion for densification. We further introduce a mechanism to control the total number of primitives generated per scene and correct a bias in the current opacity handling strategy of ADC during cloning operations. Our approach leads to consistent quality improvements across a variety of benchmark scenes, without sacrificing the method's efficiency.	This paper introduces a novel density control mechanism for 3D Gaussian Splatting (3DGS) that leverages per-pixel errors to guide the densification process, addressing limitations of the original gradient-based approach.	Adaptive Density Control (ADC) is crucial in 3DGS for determining where to allocate scene representation capacity. However, the existing gradient-based ADC can lead to underfitting, particularly in high-frequency texture areas, and lacks control over the number of primitives.	The authors propose an error-driven approach where per-pixel errors (e.g., from SSIM) are propagated back to individual Gaussian primitives based on their contribution to the rendered pixel. This per-primitive error guides densification decisions, prioritizing primitives with higher errors for splitting/cloning. The paper also introduces an opacity correction mechanism to address a bias in the cloning process and provides control over the total number of primitives to prevent memory issues.	The error-driven ADC consistently improves perceptual quality (SSIM, LPIPS) across various benchmark datasets compared to the original 3DGS and its Mip-Splatting variant. The opacity correction mechanism and primitive growth control further contribute to performance gains and stabilize training. The proposed approach effectively addresses underfitting in high-frequency texture regions, leading to more perceptually accurate reconstructions.	The method might still exhibit underfitting in scenes with complex view-dependent effects or significant appearance variations. Improving the handling of strong view-dependence, appearance variations, and limitations of linear approximation in splatting are potential avenues for future work.	gaussian splatting, 3d reconstruction, novel view synthesis, adaptive density control, densification
2404.06091 Report	Hash3D: Training-free Acceleration for 3D Generation	Xingyi Yang, Xinchao Wang	The evolution of 3D generative modeling has been notably propelled by the adoption of 2D diffusion models. Despite this progress, the cumbersome optimization process per se presents a critical hurdle to efficiency. In this paper, we introduce Hash3D, a universal acceleration for 3D generation without model training. Central to Hash3D is the insight that feature-map redundancy is prevalent in images rendered from camera positions and diffusion time-steps in close proximity. By effectively hashing and reusing these feature maps across neighboring timesteps and camera angles, Hash3D substantially prevents redundant calculations, thus accelerating the diffusion model's inference in 3D generation tasks. We achieve this through an adaptive grid-based hashing. Surprisingly, this feature-sharing mechanism not only speed up the generation but also enhances the smoothness and view consistency of the synthesized 3D objects. Our experiments covering 5 text-to-3D and 3 image-to-3D models, demonstrate Hash3D's versatility to speed up optimization, enhancing efficiency by 1.3 to 4 times. Additionally, Hash3D's integration with 3D Gaussian splatting largely speeds up 3D model creation, reducing text-to-3D processing to about 10 minutes and image-to-3D conversion to roughly 30 seconds. The project page is at https://adamdad.github.io/hash3D/.	This paper presents Hash3D, a novel training-free method to accelerate diffusion-based 3D generation by reusing features from similar camera views and timesteps through an adaptive grid-based hashing approach.	Existing 3D generative models based on 2D diffusion models suffer from lengthy optimization process due to repetitive score function sampling at various camera poses and timesteps. Hash3D addresses this efficiency bottleneck without compromising generation quality.	Hash3D introduces a memory system with an adaptive grid-based hashing function. This allows for storing and retrieving features extracted from the diffusion model across different camera poses and timesteps. When a new view is similar to a previously computed one, Hash3D reuses the features, avoiding redundant calculations.	Hash3D demonstrates its versatility by significantly speeding up both text-to-3D and image-to-3D generation processes, achieving a speed improvement of 1.3x to 4x across various baselines. Beyond efficiency gains, Hash3D also slightly enhances the visual quality of generated 3D models, as evidenced by quantitative metrics and user study results. The adaptive grid-based hashing mechanism proves effective in balancing computational cost and performance, outperforming methods using constant grid sizes or direct noise hashing.	The current implementation of adaptive grid sizing uses a brute-force search, which might be sub-optimal. Future work may explore more sophisticated hashing function learning approaches to further improve the efficiency and accuracy of feature retrieval.	3d generation, diffusion models, score distillation sampling, hashing, acceleration
2404.06050 Report	Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes	Tianchen Deng, Nailin Wang, Chongdi Wang, Shenghai Yuan, Jingchuan Wang, Danwei Wang, Weidong Chen	Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, autonomous vehicles. However, most existing methods have difficulties in large-scale scenes due to three core challenges: \textit{(a) inaccurate depth input.} Accurate depth input is impossible to get in real-world large-scale scenes. \textit{(b) inaccurate pose estimation.} Most existing approaches rely on accurate pre-estimated camera poses. \textit{(c) insufficient scene representation capability.} A single global radiance field lacks the capacity to effectively scale to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale scene reconstruction. A vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. Extended experiments have been conducted to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction.	This paper presents an incremental joint learning framework for accurate depth and pose estimation, enabling large-scale scene reconstruction using a monocular camera.	Existing methods struggle with large-scale scene reconstruction due to inaccurate depth and pose estimations, and limited scene representation capabilities.	The framework leverages a vision transformer-based depth network, a feature-metric bundle adjustment (FBA) for pose estimation, and an incremental scene representation method that dynamically creates local radiance fields.	The proposed method significantly outperforms state-of-the-art methods in novel view synthesis quality on Tanks and Temples, Static Hikes, and a proprietary dataset. FBA achieves superior pose estimation accuracy compared to existing methods, especially in large-scale scenes. The incremental scene representation method effectively handles long camera trajectories and large-scale scenes by dynamically creating local radiance fields.	The method's reliance on a good initialization of the local radiance fields might limit its performance in highly dynamic environments. Future work could focus on incorporating semantic information into the framework for richer scene understanding.	scene reconstruction, depth estimation, pose estimation, neural radiance fields, incremental learning
2404.05979 Report	StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion	Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, Changsheng Xu	Story visualization aims to generate a series of realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation. Specifically, we introduce a Target Frame Masking Strategy to extend and unify different story image generation tasks. Furthermore, we propose a Frame-Story Cross Attention Module that decomposes the cross attention for local fidelity and global coherence. Moreover, we design a Contextual Feature Extractor to extract contextual information from the whole storyline. The extensive experimental results demonstrate the excellent performance of our StoryImager. The code is available at https://github.com/tobran/StoryImager.	This paper proposes StoryImager, a unified and efficient framework for story visualization and completion, capable of generating coherent and high-fidelity story image sequences.	Existing story visualization models suffer from limitations such as unidirectional generation, high computational cost, and separate training for different tasks. This limits their usability and efficiency.	StoryImager leverages a Storyboard-based Generation approach with a Target Frame Masking Strategy to unify different story image generation tasks. It introduces a Frame-Story Cross Attention Module for local fidelity and global coherence and uses a Contextual Feature Extractor for global context information.	StoryImager outperforms previous state-of-the-art models in FID and FSD on both story visualization and continuation tasks. It demonstrates significant improvements in visual consistency and story relevance based on human evaluation. StoryImager is computationally more efficient and requires less hardware resources compared to previous models.	The model currently only supports a fixed number of frames in a storyboard. The inference speed is still limited by the diffusion model's sampling steps.	story visualization, story completion, generative model, diffusion model, computer vision
2404.05961 Report	LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders	Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy	Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.	The paper introduces LLM2Vec, a simple unsupervised approach for transforming decoder-only LLMs into strong text encoders using bidirectional attention, masked next token prediction, and unsupervised contrastive learning.	LLMs are powerful text encoders, but their causal attention mechanism limits their ability to generate rich contextual representations. LLM2Vec overcomes this limitation, enabling the use of LLMs for a wider range of NLP tasks.	LLM2Vec consists of three steps: 1) enabling bidirectional attention, 2) adapting the model to bidirectional attention via masked next token prediction (MNTP) training, and 3) applying unsupervised contrastive learning (SimCSE) for better sequence representations.	LLM2Vec-transformed models outperform encoder-only models by a large margin on word-level tasks like chunking, NER, and POS tagging. LLM2Vec achieves state-of-the-art performance among unsupervised models on the Massive Text Embeddings Benchmark (MTEB). Combining LLM2Vec with supervised contrastive learning achieves state-of-the-art performance on MTEB among models trained only on publicly available data.	The paper primarily focuses on English text corpora and benchmarks. Extending LLM2Vec to other languages is left for future work. The large size of decoder-only LLMs presents challenges for training and inference efficiency, especially for encoding large document collections.	text embedding, language models, decoder-only models, bidirectional attention, contrastive learning
2404.05729 Report	Finding Visual Task Vectors	Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar	Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, without any additional training. In this work, we analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find task vectors, activations that encode task-specific information. Equipped with this insight, we demonstrate that it is possible to identify the task vectors and use them to guide the network towards performing different tasks without providing any input-output examples. To find task vectors, we compute the average intermediate activations per task and use the REINFORCE algorithm to search for the subset of task vectors. The resulting task vectors guide the model towards performing a task better than the original model without the need for input-output examples.	This paper explores visual in-context learning by identifying and leveraging "task vectors", which are activations within computer vision models that encode task-specific information, to enable zero-shot task execution.	This research provides insights into the mechanisms of visual in-context learning and offers a novel approach to adapt models for specific tasks without requiring input-output examples, potentially improving efficiency and flexibility.	The authors propose an activation scoring mechanism to rank activations based on their task-specificity and use REINFORCE to identify the optimal subset of task vectors that, when patched into the model, guide it towards performing the desired task.	The study reveals the existence of visual task vectors in the activation space of vision transformers, particularly within specific attention heads. The proposed method enables zero-shot visual task execution by patching identified task vectors, achieving comparable or even superior performance to one-shot prompting methods. The research highlights the distributed nature of task vectors across both the encoder and decoder of the network, emphasizing their complex role in visual in-context learning.	The study focuses primarily on identifying task vectors, while acknowledging the potential presence of other important vector types, such as those encoding image structure, requiring further investigation. The optimization process currently relies on evaluating model performance in pixel space for most tasks, leaving room for exploring alternative evaluation metrics in the VQGAN token space for potentially improved accuracy.	visual in-context learning, task vectors, zero-shot learning, visual prompting, vision transformers
2404.05726 Report	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim	With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.	This paper introduces MA-LMM, a memory-augmented large multimodal model designed for efficient and effective long-term video understanding.	Existing LLM-based multimodal models struggle with long videos due to limited context length and high GPU memory consumption. MA-LMM addresses these issues by processing videos in an online manner and storing historical information in a memory bank.	MA-LMM processes video frames sequentially. It employs a visual memory bank to store raw visual features and a query memory bank to capture evolving video understanding from a Q-Former. A memory bank compression technique mitigates memory demands by merging similar adjacent features.	MA-LMM achieves state-of-the-art results on long-term video understanding benchmarks (LVU, Breakfast, COIN). It outperforms existing methods on video question answering (MSRVTT-QA, MSVD-QA) and video captioning (MSRVTT, MSVD, YouCook2) datasets. Ablation studies demonstrate the contribution of each memory bank and the effectiveness of the memory bank compression technique.	Processing long videos can still lead to prolonged inference times. Future work includes using a video or clip-based visual encoder, pre-training on large-scale video-text datasets, and incorporating a more powerful LLM.	long-term video understanding, large multimodal models, memory bank, video question answering, video captioning
2404.05719 Report	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan	Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.	Ferret-UI, a new multimodal large language model (MLLM) specifically designed for mobile UI understanding with enhanced referring, grounding, and reasoning capabilities.	Existing MLLMs often struggle with the unique characteristics of UI screens, such as elongated aspect ratios and small objects of interest, limiting their effectiveness in UI understanding and interaction tasks.	The authors build upon the Ferret MLLM and introduce several key innovations: (1) Integration of "any resolution" to handle varying screen aspect ratios and enhance detail; (2) Training on a meticulously curated dataset encompassing elementary UI tasks (e.g., icon recognition, widget listing) and advanced tasks (e.g., detailed description, function inference); (3) Development of a comprehensive benchmark for evaluating model performance across various UI understanding tasks.	Ferret-UI significantly outperforms the base Ferret model and other open-source UI MLLMs on various tasks, highlighting the importance of domain-specific training. In comparison to GPT-4V, Ferret-UI demonstrates superior performance in elementary UI tasks, especially on Android screens with numerous small widgets. Ferret-UI exhibits strong performance in advanced UI tasks, including generating detailed descriptions, engaging in grounded conversations, and inferring screen functionality.	The model's reliance on UI element detection poses a limitation, as it cannot learn aspects of screens not detected, such as colors, design, or missed UI elements. Future work includes exploring interactions beyond tapping, such as scrolling, long-clicking, and text input.	ui understanding, multimodal large language models (mllms), referring and grounding, mobile ui, screen understanding
2404.05705 Report	Learning 3D-Aware GANs from Unposed Images with Template Feature Field	Xinya Chen, Hanlei Guo, Yanrui Bin, Shangzhan Zhang, Yuanbo Yang, Yue Wang, Yujun Shen, Yiyi Liao	Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.	This paper presents TeFF, a novel 3D-aware GAN that learns a 3D semantic template feature field alongside the generative model to estimate camera poses of real-world images on the fly, eliminating the need for known camera pose distribution during training.	This is important because estimating camera poses for real-world images is difficult and expensive, especially for diverse object categories.	The method augments a generative radiance field with a semantic feature field, using the mean feature field as a template for camera pose estimation. It discretizes azimuth and elevation angles for grid search, utilizes phase correlation to estimate scale and in-plane rotation, and samples camera poses during training based on a probability distribution function derived from matching errors.	TeFF generates complete 3D geometry even for complex pose distributions, outperforming baselines on datasets like CompCars, SDIP Elephant, and LSUN Plane. The method achieves superior pose distribution estimation compared to previous approaches like 3DGP and PoF3D, as demonstrated by lower KL divergence values. Ablation studies confirm the effectiveness of using template feature fields and incorporating four degrees of freedom in camera pose estimation.	TeFF currently struggles with images exhibiting significant perspective distortion and does not model object articulation. Future work could explore using multiple templates for multi-category learning and disentangling geometry information during pose estimation.	3d-aware gan, camera pose estimation, semantic feature field, unposed image synthesis, generative radiance fields
2404.05674 Report	MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation	Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang	In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.	This paper introduces MoMA, a novel, open-vocabulary, and tuning-free image personalization model for subject-driven image generation that excels in detail fidelity, object identity resemblance, and prompt integration.	Existing methods for subject-driven image generation require extensive resources for tuning or are limited to specific domains. MoMA addresses these limitations by using a pre-trained multimodal LLM for blending text prompts with visual features, enabling alterations in both background context and object texture.	MoMA uses a three-part methodology: 1) A generative multimodal decoder (adapted LLaVA-7B) extracts and modifies image features from the reference image based on the target prompt. 2) Self-attention layers extract object features from a white-background version of the reference image. 3) Contextualized image features and object image features are injected into the UNet diffusion model during image generation.	MoMA demonstrates superior detail accuracy and faithfulness to the target object across varied backgrounds in recontextualization tasks. MoMA effectively alters the texture of target objects as dictated by text prompts while preserving unmentioned visual features. Quantitative comparisons show that MoMA outperforms existing tuning-free methods in subject fidelity and prompt faithfulness for both recontextualization and texture editing.	MoMA may struggle to accurately reproduce details for rare subjects, especially those containing text. Potential for misuse in generating deceptive content, although training excludes person-related subjects to mitigate this risk.	image generation, multimodal, personalization, llm, diffusion models
2404.05673 Report	CoReS: Orchestrating the Dance of Reasoning and Segmentation	Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, Xingang Wang	The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 7.1\% on the ReasonSeg dataset. Project: https://chain-of-reasoning-and-segmentation.github.io/.	This paper presents CoReS, a novel dual-modal chain-of-thought framework for enhancing fine-grained reasoning tasks in Multi-modal Large Language Models (MLLMs).	Existing MLLMs struggle to accurately segment objects described using complex reasoning, especially when differentiating visually similar objects. CoReS addresses this by mimicking the top-down visual hierarchy employed by humans during visual search.	CoReS utilizes a dual-chain structure: a reasoning chain generates multi-modal, hierarchical outputs from the MLLM, and a segmentation chain leverages this information for iterative segmentation refinement. In-context inputs, composed of text-based question-answer pairs, guide the MLLM to produce outputs adhering to the desired hierarchy.	CoReS outperforms state-of-the-art methods, achieving a 7.1% improvement on the ReasonSeg benchmark. Ablation studies demonstrate the effectiveness of both the dual-chain structure and the in-context guidance. CoReS exhibits greater performance gains on tasks involving complex reasoning and fine-grained segmentation, as evident from results on refCOCOg and ReasonPart datasets.	The quality of the pre-constructed context library for in-context learning could be further improved. Exploring deeper logical levels in the dual-chain structure, beyond the current two levels, is a potential direction for future research.	reasoning segmentation, multi-modal learning, chain-of-thought, in-context learning, fine-grained segmentation
2404.05667 Report	AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation	Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Xiaopeng Zhang, Yongdong Zhang, Qi Tian	A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment, i.e., the learning objective prioritizes improving the recognition accuracy of seen classes rather than unseen classes, while the latter is the true target to pursue. This issue becomes more significant in zero-shot image segmentation because the stronger (i.e., pixel-level) supervision brings a larger gap between seen and unseen classes. To mitigate it, we propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline, including proposal extraction, classification, and correction, to better fit the goal of zero-shot segmentation. (1) Mutually-Refined Proposal Extraction. AlignZeg harnesses a mutual interaction between mask queries and visual features, facilitating detailed class-agnostic mask proposal extraction. (2) Generalization-Enhanced Proposal Classification. AlignZeg introduces synthetic data and incorporates multiple background prototypes to allocate a more generalizable feature space. (3) Predictive Bias Correction. During the inference stage, AlignZeg uses a class indicator to find potential unseen class proposals followed by a prediction postprocess to correct the prediction bias. Experiments demonstrate that AlignZeg markedly enhances zero-shot semantic segmentation, as shown by an average 3.8% increase in hIoU, primarily attributed to a 7.1% improvement in identifying unseen classes, and we further validate that the improvement comes from alleviating the objective misalignment issue.	This supplementary material provides further technical specifications, additional ablation studies, expanded visualization outcomes, and more thorough comparisons with related methodologies for the AlignZeg approach.	The supplementary materials offer a deeper understanding of the AlignZeg method, its effectiveness in Generalized Zero-Shot Semantic Segmentation, and its advancements over existing techniques.	The paper elaborates on the technical intricacies of AlignZeg, presents supplementary ablation experiments focusing on parameters like λ3 and M, and provides visual representations of proposal features and segmentation results. Additionally, it delves into comparative analyses with other methodologies like ZegCLIP, SAN, DeOP, MAFT, and PMOSR.	The effectiveness of the feature expansion strategy is validated through ablation studies on the weight (λ3) for the loss Lvir. Visualizations of proposal features highlight the improved discriminative capability of AlignZeg compared to baseline methods. AlignZeg demonstrates superior performance in complex scenarios on datasets such as COCO-Stuff 164K, accurately identifying both seen and unseen class regions across various settings.	The reliance on fixed category prototypes in AlignZeg might limit the mitigation of feature entanglement for certain closely related categories. Future research could explore the adaptation of category prototypes to further enhance the model's generalization capabilities.	zero-shot learning, semantic segmentation, computer vision, deep learning, clip
2404.05666 Report	YaART: Yet Another ART Rendering Technology	Sergey Kastryulin, Artem Konev, Alexander Shishenya, Eugene Lyapustin, Artem Khurshudov, Alexander Tselousov, Nikita Vinokurov, Denis Kuznedelev, Alexander Markovich, Grigoriy Livshits, Alexey Kirillov, Anastasiia Tabisheva, Liubov Chubarova, Marina Kaminskaia, Alexander Ustyuzhanin, Artemii Shvetsov, Daniil Shlenskii, Valerii Startsev, Dmitrii Kornilov, Mikhail Romanov, Artem Babenko, Sergei Ovcharenko, Valentin Khrulkov	In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.	Introduces YaART, a production-grade text-to-image cascaded diffusion model fine-tuned with RLHF, emphasizing efficient data and computational resource usage.	Addresses the challenge of balancing model scale, data size, and computational cost in achieving high-fidelity text-to-image generation.	Employs a cascaded diffusion model architecture with three stages (GEN64, SR256, SR1024), trained using a large, high-quality dataset filtered with a Sample Fidelity Classifier. The model is further enhanced using supervised fine-tuning and RLHF for improved aesthetics and reduced defects.	Scaling model size improves training efficiency and generation quality in cascaded diffusion models. Training on a small, high-quality dataset can achieve comparable results to training on a larger dataset. RLHF significantly improves image aesthetics and reduces defects while preserving image-text relevance.	Current diffusion models require substantial human supervision for optimal results (prompt engineering, parameter tuning, post-filtering). Text generation quality is currently insufficient for practical use, leading to the exclusion of text-containing images from the training dataset.	diffusion models, text-to-image generation, cascaded diffusion, rlhf, dataset scaling
2404.05662 Report	BinaryDM: Towards Accurate Binarization of Diffusion Model	Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, Xianglong Liu	With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. In this paper, we propose BinaryDM, a novel accurate quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit. Firstly, we present a Learnable Multi-basis Binarizer (LMB) to recover the representations generated by the binarized DM, which improves the information in details of representations crucial to the DM. Secondly, a Low-rank Representation Mimicking (LRM) is applied to enhance the binarization-aware optimization of the DM, alleviating the optimization direction ambiguity caused by fine-grained alignment. Moreover, a progressive initialization strategy is applied to training DMs to avoid convergence difficulties. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the first binarization method for diffusion models, BinaryDM achieves impressive 16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit activation, showcasing its substantial advantages and potential for deploying DMs on resource-limited scenarios.	This paper proposes BinaryDM, a novel quantization-aware training approach to achieve accurate 1-bit weight diffusion models.	Quantization, especially binarization, is crucial for deploying diffusion models on resource-limited devices by offering compact model size and efficient inference. However, existing methods suffer severe accuracy degradation when applied to diffusion models with ultra-low bit-widths.	BinaryDM introduces two key components: 1) a Learnable Multi-basis Binarizer (LMB) to recover rich representations from binarized weights, and 2) a Low-rank Representation Mimicking (LRM) strategy to enhance optimization stability and accuracy by aligning binarized models with full-precision counterparts in a low-rank space. Additionally, a progressive initialization strategy is employed to facilitate training convergence.	BinaryDM achieves significant accuracy and efficiency gains compared to existing quantization methods for diffusion models under ultra-low bit-widths (1-bit weight and 4/8-bit activations). The proposed method demonstrates strong performance on both unconditional and conditional image generation tasks across various datasets, including CIFAR-10, LSUN, FFHQ, and ImageNet. BinaryDM achieves impressive compression and acceleration, yielding up to 16.0× FLOPs and 27.1× storage savings.	The training process of BinaryDM is still computationally intensive compared to post-training quantization methods. Further research can explore extending BinaryDM to other diffusion model variants and exploring its performance in more downstream applications.	diffusion models, model quantization, binarization, generative models, model compression
2404.05657 Report	MLP Can Be A Good Transformer Learner	Sihao Lin, Pumeng Lyu, Dongrui Liu, Tao Tang, Xiaodan Liang, Andy Song, Xiaojun Chang	Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at https://github.com/sihaoevery/lambda_vit.	This paper proposes a novel method for simplifying vision transformers by selectively removing non-essential attention layers based on entropy considerations, leading to reduced computational load and memory footprint without sacrificing performance.	The self-attention mechanism in transformers, while powerful, is computationally demanding. Existing token pruning methods address computational redundancy but don't reduce memory costs. This work aims to push the memory bound by directly removing uninformative attention layers.	The method leverages entropy to quantify the information carried by attention layers. It employs a novel Entropy-based Selection Strategy (NOSE) to identify combinations of attention layers with minimal impact on final output. A dilution learning technique then degenerates selected attention layers into identity mappings, effectively integrating them into subsequent MLP layers.	The method can remove 40% of attention layers in DeiT-B without performance degradation on ImageNet-1k. It improves throughput by up to 36.5% and memory bound by over 20% compared to existing token pruning methods. The learned features exhibit superior transferability, outperforming competing methods in linear probing experiments on CIFAR-100.	The paper primarily focuses on DeiT architecture; exploring other transformer variants could broaden applicability. The impact of removing attention layers on downstream tasks beyond classification and segmentation requires further investigation.	vision transformer, attention pruning, entropy, model compression, memory bound
2404.05626 Report	Learning a Category-level Object Pose Estimator without Pose Annotations	Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang	3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.	This paper presents detailed results of a new method for 3D pose estimation on the PASCAL3D+ dataset, particularly focusing on few-shot learning scenarios.	Accurately estimating 3D object pose from a single image is crucial in various applications but remains challenging, especially with limited training data.	The methodology leverages a diffusion model (Zero123) to generate multiple views of an object with varying poses, which are then used to optimize neural meshes for pose estimation.	The method achieves promising results even when only one annotated instance per object category is available (e.g., 87.4% accuracy for buses). Performance improves with more annotated instances, demonstrating the effectiveness of the few-shot learning approach. Detailed results are provided for seven object categories, including bus, car, boat, motorbike, bicycle, and aeroplane.	The paper currently lacks details about the neural mesh optimization process. Future work will focus on publicly releasing the implementation code.	3d pose estimation, few-shot learning, diffusion models, neural meshes, pascal3d+
2404.05621 Report	MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning	Matteo Farina, Massimiliano Mancini, Elia Cunegatti, Gaowen Liu, Giovanni Iacca, Elisa Ricci	While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.	This paper proposes Task-Agnostic Vision-Language Model Pruning (TA-VLP), aiming to prune a VLM once while maintaining transferability to unknown downstream tasks.	Current VLM pruning methods are task-specific, demanding costly re-pruning for each new task. TA-VLP addresses this by enabling a single pruning step for multiple unknown downstream tasks.	The paper introduces Multimodal Flow Pruning (MFP), a gradient-free method for TA-VLP. MFP models each layer as a bipartite graph, where a parameter's importance is determined by its magnitude and the saliency of the neurons it connects. It also incorporates the multimodal distribution of pretrained VLM parameters to avoid biases.	MFP outperforms or matches state-of-the-art pruning methods on various vision-language tasks (Image-Text Retrieval, Image Captioning, Visual Question Answering) across different pruning ratios. Different VLMs (BLIP, XVLM) and tasks exhibit varying degrees of 'prunability'. MFP demonstrates robustness even at extreme sparsity (90%) with XVLM, highlighting its effectiveness for aggressive compression.	The paper focuses on unstructured pruning, limiting its immediate impact on reducing FLOPs and runtime. Future work could explore extending MFP to structured pruning, leveraging its neuron-level importance formulation.	vision-language models, model pruning, transfer learning, multimodality, information flow
2404.05607 Report	A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion	Guokai Zhang, Lanjun Wang, Yuting Su, An-An Liu	Nowadays, the family of Stable Diffusion (SD) models has gained prominence for its high quality outputs and scalability. This has also raised security concerns on social media, as malicious users can create and disseminate harmful content. Existing approaches involve training components or entire SDs to embed a watermark in generated images for traceability and responsibility attribution. However, in the era of AI-generated content (AIGC), the rapid iteration of SDs renders retraining with watermark models costly. To address this, we propose a training-free plug-and-play watermark framework for SDs. Without modifying any components of SDs, we embed diverse watermarks in the latent space, adapting to the denoising process. Our experimental findings reveal that our method effectively harmonizes image quality and watermark invisibility. Furthermore, it performs robustly under various attacks. We also have validated that our method is generalized to multiple versions of SDs, even without retraining the watermark model.	This paper proposes a training-free plug-and-play watermark framework for Stable Diffusion models, enabling embedding diverse watermarks in the latent space without retraining.	The rapid evolution of SD models makes retraining watermark models costly and impractical. This framework offers a flexible and efficient alternative for embedding traceable watermarks in generated images.	The method involves training a watermark encoder-decoder architecture using a frozen VAE encoder-decoder from SD. The compressed watermark is embedded in the latent code after denoising, minimizing impact on image quality.	Achieves high watermark invisibility, evidenced by high PSNR and SSIM scores. Maintains good watermark extraction quality, with high NC and low CER values. Demonstrates generalization across different SD versions without retraining.	Watermark robustness against high-angle rotations requires further investigation. Localized pixel variations may occur in specific samples after watermark embedding.	watermarking, stable diffusion, text-to-image synthesis, training-free, plug-and-play
2404.05603 Report	Self-Explainable Affordance Learning with Embodied Caption	Zhipeng Zhang, Zhimin Wei, Guolei Sun, Peng Wang, Luc Van Gool	In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.	This paper introduces Self-Explainable Affordance Learning (SEA), a new paradigm for robots to not only learn touchable regions for object manipulation, but also generate embodied captions explaining their intended actions and target objects.	Existing visual affordance learning methods suffer from action ambiguity in complex scenes, lacking interpretability for human understanding and potential error correction.	A novel SEA dataset with embodied captions is created based on AGD20K. The proposed SEA model utilizes DINO-ViT and CLIP for visual and multimodal embedding respectively, a Self-Explainable Former for action and object classification, and a Pixel-level Fusion architecture for affordance map localization.	SEA effectively combines affordance grounding with self-explanation, outperforming baselines on affordance grounding metrics (KLD, SIM, NSS). The model successfully generates self-explainable captions, achieving high accuracy in object-action identification using top-k metrics. Qualitative results showcase reduced ambiguity and improved interpretability in predicting touchable regions and associated actions.	Limitations exist in handling complex, open-world scenarios with intricate sentence structures. Future work will explore advanced language models and feedback mechanisms for enhanced human-robot interaction.	embodied caption, visual affordance learning, self-explainable ai, robotics, vision-language
2404.05595 Report	UniFL: Improve Stable Diffusion via Unified Feedback Learning	Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, Guanbin Li	Diffusion models have revolutionized the field of image generation, leading to the proliferation of high-quality models and diverse downstream applications. However, despite these significant advancements, the current competitive solutions still suffer from several limitations, including inferior visual quality, a lack of aesthetic appeal, and inefficient inference, without a comprehensive solution in sight. To address these challenges, we present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively. UniFL stands out as a universal, effective, and generalizable solution applicable to various diffusion models, such as SD1.5 and SDXL. Notably, UniFL incorporates three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which optimizes inference speed. In-depth experiments and extensive user studies validate the superior performance of our proposed method in enhancing both the quality of generated models and their acceleration. For instance, UniFL surpasses ImageReward by 17% user preference in terms of generation quality and outperforms LCM and SDXL Turbo by 57% and 20% in 4-step inference. Moreover, we have verified the efficacy of our approach in downstream tasks, including Lora, ControlNet, and AnimateDiff.	UniFL, a unified feedback learning framework for improving visual quality, aesthetics, and inference speed of diffusion models, particularly stable diffusion.	Existing diffusion models suffer from inferior visual quality, lack of aesthetic appeal, and inefficient inference. UniFL offers a comprehensive solution addressing these challenges simultaneously.	UniFL leverages three key components: 1) Perceptual Feedback Learning (PeFL) using existing perceptual models for visual quality; 2) Decoupled Feedback Learning with dimension-specific reward models for aesthetics; 3) Adversarial Feedback Learning for inference acceleration.	UniFL significantly enhances generation quality, outperforming ImageReward by 17% in user preference. UniFL achieves superior acceleration, surpassing LCM by 57% in a 4-step inference user study. UniFL demonstrates strong generalization across downstream tasks like LoRA, ControlNet, and AnimateDiff.	Exploring larger visual perception models for enhanced supervision in PeFL. Investigating extreme acceleration possibilities, particularly towards 1-step inference.	diffusion models, feedback learning, text-to-image generation, inference acceleration, aesthetic quality
2404.05580 Report	Responsible Visual Editing	Minheng Ni, Yeli Shen, Lei Zhang, Wangmeng Zuo	With recent advancements in visual synthesis, there is a growing risk of encountering images with detrimental effects, such as hate, discrimination, or privacy violations. The research on transforming harmful images into responsible ones remains unexplored. In this paper, we formulate a new task, responsible visual editing, which entails modifying specific concepts within an image to render it more responsible while minimizing changes. However, the concept that needs to be edited is often abstract, making it challenging to locate what needs to be modified and plan how to modify it. To tackle these challenges, we propose a Cognitive Editor (CoEditor) that harnesses the large multimodal model through a two-stage cognitive process: (1) a perceptual cognitive process to focus on what needs to be modified and (2) a behavioral cognitive process to strategize how to modify. To mitigate the negative implications of harmful images on research, we create a transparent and public dataset, AltBear, which expresses harmful information using teddy bears instead of humans. Experiments demonstrate that CoEditor can effectively comprehend abstract concepts within complex scenes and significantly surpass the performance of baseline models for responsible visual editing. We find that the AltBear dataset corresponds well to the harmful content found in real images, offering a consistent experimental evaluation, thereby providing a safer benchmark for future research. Moreover, CoEditor also shows great results in general editing. We release our code and dataset at https://github.com/kodenii/Responsible-Visual-Editing.	Introduces 'responsible visual editing', modifying specific concepts in images to make them safer, fairer, or more privacy-conscious.	Addresses the growing risk of harmful images created by advanced visual synthesis technology.	Proposes CoEditor, a model leveraging large multimodal models (LMMs) with a two-stage cognitive process: (1) perceptual cognition to identify regions needing modification and (2) behavioral cognition to strategize the modification.	CoEditor significantly outperforms baseline models in responsible image editing. The proposed AltBear dataset, using teddy bears to depict harmful content, shows high consistency with real data while mitigating ethical risks. CoEditor demonstrates strong performance in general image editing as well.	High computational cost due to reliance on LMMs. GPT API's non-deterministic nature poses reproducibility challenges.	responsible visual editing, image editing, large multimodal model, responsible ai, altbear dataset
2404.05578 Report	Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning	Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi	For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.	This paper presents Social-MAE, a transformer-based masked autoencoder framework for learning representations of multi-person motion data by reconstructing masked human joint trajectories.	Existing methods for high-level social tasks in multi-person scenes suffer from a lack of sufficient annotated data. Social-MAE addresses this challenge using self-supervised pre-training via masked modeling, enabling the learning of generalizable and data-efficient motion representations.	Social-MAE consists of a transformer encoder and a lighter-weight transformer decoder operating on multi-person joint trajectories in the frequency domain. The model is pre-trained by masking and reconstructing joint trajectories. For specific downstream tasks, the decoder is replaced with a task-specific one and fine-tuned with full supervision.	Social-MAE achieves state-of-the-art results on the SoMoF benchmark for multi-person pose forecasting. Social-MAE outperforms previous methods on CMU-Mocap and MuPoTS-3D datasets for multi-person pose forecasting. Social-MAE sets new state-of-the-art results on JRDB-Act for social grouping and action detection using only pose data as input.	Social-MAE currently only uses pose data and could benefit from incorporating visual features for richer context. The model's performance on social grouping could be further improved by incorporating 3D information.	social masked autoencoder, unsupervised pre-training, multi-person motion representation, pose forecasting, social grouping, action understanding
2404.05519 Report	Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models	Saman Motamed, Wouter Van Gansbeke, Luc Van Gool	With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.	This paper presents an initial exploration of using cross-attention layers in Text-to-Video (T2V) diffusion models for zero-shot video editing, focusing on controlling object size, position, and motion.	Enabling zero-shot editing in T2V models provides greater flexibility and user control over generated video content without requiring additional training data or resources.	The authors investigate two approaches: 1) Forward Guidance: Directly manipulating cross-attentions during the denoising process (similar to Prompt-to-Prompt), and 2) Backward Guidance: Using an energy-based loss function to guide the model towards desired cross-attention maps for specific tokens (inspired by Diffusion self-guidance and Training-Free Layout Control).	Forward guidance in T2V models faces similar limitations as in image editing, such as size/shape mismatches and cross-attention overlap, hindering its effectiveness. Backward guidance shows promise for zero-shot editing in T2V models, successfully enabling control over object size and motion by manipulating cross-attention maps. Current T2V models exhibit limitations like noisy cross-attention maps compared to T2I models, necessitating further research to improve their quality and enable more robust editing techniques.	Current limitations in T2V models, particularly noisy cross-attention maps, restrict the effectiveness of the proposed editing techniques. The study focuses on manipulating object size and motion, leaving other editing aspects like background control and maintaining fidelity for future work.	text-to-video synthesis, video editing, diffusion models, cross-attention, zero-shot learning
2404.05384 Report	Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance	Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu	Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.	This paper introduces Semantic-aware Classifier-Free Guidance (S-CFG), a novel approach to enhance text-to-image diffusion models by customizing guidance degrees for different semantic units within an image.	Existing Classifier-Free Guidance (CFG) methods apply a global scale, leading to spatial inconsistency in varying semantic strengths and potentially suboptimal image quality.	S-CFG leverages attention maps from the diffusion model's U-net backbone to segment the latent image into semantic regions. It then adaptively adjusts CFG scales across these regions, aiming to unify the classifier score norm and balance semantic information.	S-CFG consistently outperforms the original CFG strategy across various text-to-image diffusion models, including Stable Diffusion and DeepFloyd IF, as evidenced by FID-30K and CLIP Score metrics. Human evaluation confirms the superiority of S-CFG in terms of both image quality and image-text alignment. Qualitative analysis reveals notable improvements in generated samples, showcasing enhanced semantic expressiveness, entity portrayal, and fine-grained structure completion.	The assumption of complete independence among different semantic units may not always hold true in practice. Current evaluation metrics might not fully capture all aspects of image quality improvement achieved by S-CFG.	text-to-image generation, diffusion models, classifier-free guidance, semantic segmentation, attention mechanisms
2404.05331 Report	Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt	Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li	Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.	This paper proposes Mask-ControlNet, a framework for higher-quality image generation by introducing an additional mask prompt to decouple and model the foreground and background relationship in reference images.	Existing text-to-image generation methods struggle to accurately control object appearance and maintain fidelity to reference images, particularly in complex compositions.	The method uses large vision models (SAM) to obtain object masks from reference images. These masks, along with the reference image and text prompts, are used as conditional information for a diffusion model during image synthesis.	Mask-ControlNet generates higher-quality images with fewer artifacts compared to previous methods (DreamBooth, ControlNet+LoRA, Outpainting). The method shows superior quantitative performance in FID, PSNR, SSIM, LPIPS, and user studies. Mask prompts effectively decouple foreground and background, leading to better object fidelity, reduced background overfitting, and improved foreground-background harmony.	The reliance on pre-trained large vision models like SAM might limit generalizability. Future work includes exploring the impact of mask quality and different mask generation techniques on the generated images.	image generation, diffusion model, controllable image synthesis, object fidelity, mask prompt
2404.05268 Report	MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation	Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu, Wangmeng Zuo	Customized text-to-image generation aims to synthesize instantiations of user-specified concepts and has achieved unprecedented progress in handling individual concept. However, when extending to multiple customized concepts, existing methods exhibit limitations in terms of flexibility and fidelity, only accommodating the combination of limited types of models and potentially resulting in a mix of characteristics from different concepts. In this paper, we introduce the Multi-concept guidance for Multi-concept customization, termed MC$^2$, for improved flexibility and fidelity. MC$^2$ decouples the requirements for model architecture via inference time optimization, allowing the integration of various heterogeneous single-concept customized models. It adaptively refines the attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. Extensive experiments demonstrate that MC$^2$ even surpasses previous methods that require additional training in terms of consistency with input prompt and reference images. Moreover, MC$^2$ can be extended to elevate the compositional capabilities of text-to-image generation, yielding appealing results. Code will be publicly available at https://github.com/JIANGJiaXiu/MC-2.	MC$^2$ is proposed as a novel method to synthesize compositions of multiple customized concepts by integrating separately trained single-concept customized models, without joint training, model merging, or extra conditioning information like bounding boxes.	Existing methods for multi-concept customization in text-to-image generation lack flexibility and fidelity, limiting the types of models that can be combined and potentially leading to mixed concept characteristics.	MC$^2$ leverages inference time optimization with multi-concept guidance (MCG). It analyzes cross-attention maps during diffusion to identify regions activated by different concepts, then refines attention weights to spatially disentangle them, promoting proper attribute binding.	MC$^2$ demonstrates higher fidelity to reference images compared to baselines, even surpassing methods requiring additional training. Quantitative metrics on CustomConcept101 and a compositional generation benchmark show superior performance in subject/prompt fidelity and object/prompt similarity. User studies confirm MC$^2$'s effectiveness, with users finding its outputs more aligned with prompts and reference concepts.	The current implementation using parallel diffusion models is memory intensive. The composed customized models are limited to those trained from the same diffusion model.	text-to-image generation, customized multi-concept generation, compositional generation, diffusion models, cross-attention
2404.05236 Report	Stylizing Sparse-View 3D Scenes with Hierarchical Neural Representation	Y. Wang, A. Gao, Y. Gong, Y. Zeng	Recently, a surge of 3D style transfer methods has been proposed that leverage the scene reconstruction power of a pre-trained neural radiance field (NeRF). To successfully stylize a scene this way, one must first reconstruct a photo-realistic radiance field from collected images of the scene. However, when only sparse input views are available, pre-trained few-shot NeRFs often suffer from high-frequency artifacts, which are generated as a by-product of high-frequency details for improving reconstruction quality. Is it possible to generate more faithful stylized scenes from sparse inputs by directly optimizing encoding-based scene representation with target style? In this paper, we consider the stylization of sparse-view scenes in terms of disentangling content semantics and style textures. We propose a coarse-to-fine sparse-view scene stylization framework, where a novel hierarchical encoding-based neural representation is designed to generate high-quality stylized scenes directly from implicit scene representations. We also propose a new optimization strategy with content strength annealing to achieve realistic stylization and better content preservation. Extensive experiments demonstrate that our method can achieve high-quality stylization of sparse-view scenes and outperforms fine-tuning-based baselines in terms of stylization quality and efficiency.	This paper proposes a novel coarse-to-fine 3D scene stylization framework for generating high-quality stylized scenes from sparse input views.	Existing style transfer methods struggle to produce high-quality stylized 3D scenes from sparse inputs due to the difficulty in reconstructing accurate high-frequency details.	The method uses a hierarchical encoding-based neural representation to disentangle content semantics and style textures. It first reconstructs coarse geometry from sparse inputs and then utilizes a multi-resolution feature grid to generate stylized details guided by the coarse geometry.	The method generates high-quality stylized scenes with multi-view consistency from sparse inputs. It outperforms state-of-the-art methods both quantitatively and qualitatively in terms of stylization quality and efficiency. The proposed content strength annealing strategy effectively balances content preservation and style matching.	The method's performance depends on the quality of the coarse geometry reconstruction. The current implementation focuses on style transfer of static scenes.	3d style transfer, neural radiance fields, sparse-view synthesis, hierarchical representation, content annealing
2404.05225 Report	LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding	Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, Cong Yao	Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information, which is vital for precise document understanding. In this paper, we propose LayoutLLM, an LLM/MLLM based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy, which is specially designed to enhance the comprehension and utilization of document layouts. The proposed layout instruction tuning strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture the characteristics of document layout in Layout-aware Pre-training, three groups of pre-training tasks, corresponding to document-level, region-level and segment-level information, are introduced. Furthermore, a novel module called layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on regions relevant to the question and generate accurate answers. LayoutCoT is effective for boosting the performance of document understanding. Meanwhile, it brings a certain degree of interpretability, which could facilitate manual inspection and correction. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding. The training data of the LayoutLLM is publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM	This paper introduces LayoutLLM, an LLM/MLLM-based document understanding method enhanced with a novel layout instruction tuning strategy.	Existing LLM/MLLM document understanding approaches fail to effectively utilize crucial document layout information, limiting their accuracy and interpretability.	LayoutLLM integrates a document pre-trained model encoder and employs a two-stage layout instruction tuning strategy: 1) Layout-aware Pre-training with document, region, and segment-level tasks. 2) Layout-aware Supervised Fine-tuning using a novel LayoutCoT module for interpretable, step-by-step reasoning.	LayoutLLM significantly outperforms existing zero-shot LLM/MLLM methods on document understanding benchmarks. Layout-aware pre-training significantly enhances the model's understanding of document layouts at different levels. The LayoutCoT module effectively boosts performance, particularly for complex tasks, and provides interpretability.	LayoutLLM currently lacks the ability to refuse false-positive outputs and provide hints. Despite improvements from layout-aware pre-training, precisely understanding complex region-level relationships remains challenging.	document understanding, large language models, multimodal learning, document layout analysis, instruction tuning
2404.05188 Report	Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging	Tianshuo Cong, Delong Ran, Zesen Liu, Xinlei He, Jinyuan Liu, Yichen Gong, Qi Li, Anyu Wang, Xiaoyun Wang	Model merging is a promising lightweight model empowerment technique that does not rely on expensive computing devices (e.g., GPUs) or require the collection of specific training data. Instead, it involves editing different upstream model parameters to absorb their downstream task capabilities. However, uncertified model merging can infringe upon the Intellectual Property (IP) rights of the original upstream models. In this paper, we conduct the first study on the robustness of IP protection methods in model merging scenarios. We investigate two state-of-the-art IP protection techniques: Quantization Watermarking and Instructional Fingerprint, along with various advanced model merging technologies, such as Task Arithmetic, TIES-MERGING, and so on. Experimental results indicate that current Large Language Model (LLM) watermarking techniques cannot survive in the merged models, whereas model fingerprinting techniques can. Our research aims to highlight that model merging should be an indispensable consideration in the robustness assessment of model IP protection techniques, thereby promoting the healthy development of the open-source LLM community.	This paper presents the first robustness analysis of Large Language Model (LLM) Intellectual Property (IP) protection methods against model merging attacks.	Unauthorized model merging can infringe on the IP rights of upstream LLM developers, hindering the open-source LLM community's growth.	The authors evaluate the robustness of Quantization Watermarking and Instructional Fingerprinting techniques against four model merging algorithms: Model Soups, Task Arithmetic, TIES-MERGING, and DARE.	Model merging successfully combines the functionalities of different LLMs, creating a composite model with multifunctionality. Existing LLM watermarking techniques are vulnerable to model merging attacks, with the watermark information being effectively removed during the merging process. LLM fingerprinting, specifically Instructional Fingerprinting, demonstrates stronger robustness against model merging compared to watermarking, successfully retaining fingerprint information in the merged models.	The paper focuses on merging only two models, leaving the exploration of more complex merging scenarios involving multiple models as future work. Further investigation into more advanced merging algorithms and their impact on IP protection methods is needed.	large language models, model merging, ip protection, watermarking, fingerprinting
2404.05072 Report	Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind	Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen	As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera - hence keeping in mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.	Introduces "Out of Sight, Not Out of Mind" (OSNOM) task and the Lift, Match, and Keep (LMK) method for 3D tracking of active objects in egocentric videos, even when out-of-view.	Spatial cognition, the ability to track objects even when unseen, is crucial for humans and essential for building AI agents that can understand and interact with the world like humans do.	LMK lifts 2D object observations to 3D using scene geometry and depth, matches these 3D observations over time using appearance and location, and maintains object tracks even when objects are out-of-sight.	LMK significantly outperforms baselines, demonstrating the importance of 3D tracking and object permanence for this task. The method achieves 64% accuracy in locating objects after 1 minute and 37% accuracy after 10 minutes of being out-of-view. Combining visual appearance and 3D location information is crucial for robust tracking, especially in cluttered scenes.	The current work relies on ground-truth masks; future work will focus on incorporating object detectors for end-to-end learning. Future research will explore extending OSNOM to multiple videos over longer timescales and investigate its applicability in real-world assistive scenarios.	egocentric vision, object tracking, 3d understanding, spatial cognition, object permanence
2404.05014 Report	MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators	Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo	Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbf{MagicTime}, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world.	This paper introduces MagicTime, a novel framework designed to enhance text-to-video generation models by incorporating physical world knowledge, specifically enabling them to generate metamorphic time-lapse videos.	Existing T2V models struggle to produce videos that accurately depict complex real-world processes like melting or blooming due to limited encoding of physical knowledge. MagicTime aims to address this limitation by leveraging the characteristics of metamorphic videos, which comprehensively capture object transformation.	MagicTime utilizes a MagicAdapter scheme for decoupled spatial and temporal training, leveraging time-lapse videos. It introduces Dynamic Frames Extraction to prioritize metamorphic features and a Magic Text-Encoder for better understanding metamorphic prompts. A new time-lapse dataset, ChronoMagic, is also created for training and evaluation.	MagicTime successfully generates high-quality metamorphic videos demonstrating a clear understanding of physical processes like melting and blooming. Quantitative analysis shows MagicTime outperforms existing T2V methods in metrics such as FID, FVD, and CLIP Similarity, indicating superior video quality and text-alignment. Human evaluation confirms MagicTime's superiority in producing visually appealing and semantically accurate metamorphic videos compared to baseline models.	Current evaluation metrics for T2V models may not fully encapsulate the nuances of metamorphic video generation, necessitating more robust evaluation methods. Expanding the ChronoMagic dataset with diverse scenarios and complexities can further enhance the generalization capabilities of MagicTime.	text-to-video generation, metamorphic videos, time-lapse videos, physical knowledge encoding, diffusion models
2404.04956 Report	Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models	Zijin Yang, Kai Zeng, Kejiang Chen, Han Fang, Weiming Zhang, Nenghai Yu	Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. However, existing methods often compromise the model performance or require additional training, which is undesirable for operators and users. To address this issue, we propose Gaussian Shading, a diffusion model watermarking technique that is both performance-lossless and training-free, while serving the dual purpose of copyright protection and tracing of offending content. Our watermark embedding is free of model parameter modifications and thus is plug-and-play. We map the watermark to latent representations following a standard Gaussian distribution, which is indistinguishable from latent representations obtained from the non-watermarked diffusion model. Therefore we can achieve watermark embedding with lossless performance, for which we also provide theoretical proof. Furthermore, since the watermark is intricately linked with image semantics, it exhibits resilience to lossy processing and erasure attempts. The watermark can be extracted by Denoising Diffusion Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian Shading on multiple versions of Stable Diffusion, and the results demonstrate that Gaussian Shading not only is performance-lossless but also outperforms existing methods in terms of robustness.	This paper introduces Gaussian Shading, a novel watermarking technique for diffusion models that is both performance-lossless and training-free.	This technique addresses the critical need for copyright protection and content authentication of AI-generated images without compromising the quality of the generated output.	The method maps a watermark to latent representations following a standard Gaussian distribution, ensuring it remains indistinguishable from non-watermarked representations. This watermark is then diffused throughout the image semantics during the generation process, making it robust to alterations.	Gaussian Shading maintains high true positive rates (over 99%) in watermark detection even under significant noise perturbation. The method exhibits superior robustness in traceability tasks, achieving over 97% bit accuracy under various attacks. Unlike existing methods, Gaussian Shading demonstrates statistically insignificant impact on the visual quality and image-text similarity of generated images, effectively preserving model performance.	The current implementation relies on DDIM inversion, limiting its applicability to diffusion models utilizing continuous-time samplers based on ODE solvers. The method's reliance on stream ciphers necessitates secure key management and assumes the model is not publicly accessible to prevent forgery attacks.	watermarking, diffusion models, copyright protection, content authentication, ai-generated images
2404.04946 Report	AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment	Yuanfeng Xu, Yuhao Chen, Zhongzhan Huang, Zijian He, Guangrun Wang, Philip Torr, Liang Lin	Recent video editing advancements rely on accurate pose sequences to animate subjects. However, these efforts are not suitable for cross-species animation due to pose misalignment between species (for example, the poses of a cat differs greatly from that of a pig due to differences in body structure). In this paper, we present AnimateZoo, a zero-shot diffusion-based video generator to address this challenging cross-species animation issue, aiming to accurately produce animal animations while preserving the background. The key technique used in our AnimateZoo is subject alignment, which includes two steps. First, we improve appearance feature extraction by integrating a Laplacian detail booster and a prompt-tuning identity extractor. These components are specifically designed to capture essential appearance information, including identity and fine details. Second, we align shape features and address conflicts from differing subjects by introducing a scale-information remover. This ensures accurate cross-species animation. Moreover, we introduce two high-quality animal video datasets featuring a wide variety of species. Trained on these extensive datasets, our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames, without the need for the pre-inference fine-tuning that prior arts required. Extensive experiments showcase the outstanding performance of our method in cross-species action following tasks, demonstrating exceptional shape adaptation capability. The project page is available at https://justinxu0.github.io/AnimateZoo/.	AnimateZoo, a zero-shot diffusion-based video generator for cross-species animation using pose sequences from different animals, preserving background and enabling accurate action inheritance.	Addresses the limitations of existing intra-species animation methods that struggle with pose misalignment between species, enabling cross-species animation with accurate movements, consistent appearance, and high-fidelity frames.	Employs subject alignment through: 1) Laplacian detail booster and prompt-tuning identity extractor for appearance feature extraction, 2) Scale-information remover to align shape features and address conflicts from differing subjects.	Generates videos with accurate movements and consistent appearance across different animal species. Preserves background information from the source video while animating the target subject. Outperforms existing methods in cross-species animation tasks, demonstrating superior shape adaptability.	Struggles with accurately depicting interactions between multiple objects, particularly in cases of occlusion. Reliance on accurate segmentation of the reference subject, which may be challenging in complex scenes.	video editing, cross-species animation, subject alignment, diffusion models, zero-shot learning
2404.04913 Report	CodecNeRF: Toward Fast Encoding and Decoding, Compact, and High-quality Novel-view Synthesis	Gyeongjin Kang, Younggeun Lee, Eunbyung Park	Neural Radiance Fields (NeRF) have achieved huge success in effectively capturing and representing 3D objects and scenes. However, several factors have impeded its further proliferation as next-generation 3D media. To establish a ubiquitous presence in everyday media formats, such as images and videos, it is imperative to devise a solution that effectively fulfills three key objectives: fast encoding and decoding time, compact model sizes, and high-quality renderings. Despite significant advancements, a comprehensive algorithm that adequately addresses all objectives has yet to be fully realized. In this work, we present CodecNeRF, a neural codec for NeRF representations, consisting of a novel encoder and decoder architecture that can generate a NeRF representation in a single forward pass. Furthermore, inspired by the recent parameter-efficient finetuning approaches, we develop a novel finetuning method to efficiently adapt the generated NeRF representations to a new test instance, leading to high-quality image renderings and compact code sizes. The proposed CodecNeRF, a newly suggested encoding-decoding-finetuning pipeline for NeRF, achieved unprecedented compression performance of more than 150x and 20x reduction in encoding time while maintaining (or improving) the image quality on widely used 3D object datasets, such as ShapeNet and Objaverse.	CodecNeRF, a neural codec for NeRF representation with novel encoder and decoder architectures for fast encoding/decoding and compact model size, and a parameter-efficient finetuning method for high-quality rendering.	To enable ubiquitous presence of NeRF in everyday media formats, addressing the need for fast encoding/decoding, compact model sizes, and high-quality renderings.	An encoder-decoder architecture generates a NeRF representation in a single forward pass. Parameter-efficient finetuning adapts the representation to new instances using low-rank adaptation and tensor factorization. Entropy coding further compresses finetuned parameters.	Achieved over 150x compression and 20x encoding speedup compared to per-scene optimization. Maintains or improves image quality on ShapeNet and Objaverse datasets. Demonstrated superior generalization performance and fast convergence compared to existing methods.	Extending to complex scenes like large-scale environments requires further exploration. Supporting other NeRF representations, such as instant NGP, necessitates architectural modifications.	neural radiance fields, nerf compression, neural codec, parameter-efficient finetuning, novel view synthesis
2404.04908 Report	Dual-Camera Smooth Zoom on Mobile Phones	Renlong Wu, Zhilu Zhang, Yu Yang, Wangmeng Zuo	When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user's zoom experience. In this work, we introduce a new task, ie, dual-camera smooth zoom (DCSZ) to achieve a smooth zoom preview. The frame interpolation (FI) technique is a potential solution but struggles with ground-truth collection. To address the issue, we suggest a data factory solution where continuous virtual cameras are assembled to generate DCSZ data by rendering reconstructed 3D models of the scene. In particular, we propose a novel dual-camera smooth zoom Gaussian Splatting (ZoomGS), where a camera-specific encoding is introduced to construct a specific 3D model for each virtual camera. With the proposed data factory, we construct a synthetic dataset for DCSZ, and we utilize it to fine-tune FI models. In addition, we collect real-world dual-zoom images without ground-truth for evaluation. Extensive experiments are conducted with multiple FI methods. The results show that the fine-tuned FI models achieve a significant performance improvement over the original ones on DCSZ task. The datasets, codes, and pre-trained models will be publicly available.	This paper proposes Dual-Camera Smooth Zoom (DCSZ) to generate a fluid zoom preview on mobile phones, addressing the issue of jumps in geometric content and color when switching between dual cameras.	The jumps that occur during zoom preview on smartphones with dual cameras significantly impact user experience. This work provides a solution to create a smoother and more visually appealing zoom transition.	The proposed method uses a data factory approach. It leverages 3D reconstruction to generate virtual camera views between the actual ultra-wide and wide cameras. This synthetic data is then used to fine-tune existing frame interpolation models for smoother zoom transitions.	Fine-tuned FI models show significant performance improvement over pre-trained models on DCSZ. The proposed ZoomGS method for constructing camera-specific 3D models outperforms standard 3DGS. The synthetic data generated by the data factory effectively improves FI model performance in real-world scenarios.	The generalization of the fine-tuned FI model to other mobile devices with different dual-camera setups needs further investigation. Future work can explore optimizing the data factory for real-time performance on mobile devices.	dual-camera zoom, frame interpolation, 3d reconstruction, gaussian splatting, smooth zoom
2404.04875 Report	NeRF2Points: Large-Scale Point Cloud Generation From Street Views' Radiance Field Optimization	Peng Tu, Xun Zhou, Mingming Wang, Xiaojun Yang, Bo Peng, Ping Chen, Xiu Su, Yawen Huang, Yefeng Zheng, Chang Xu	Neural Radiance Fields (NeRF) have emerged as a paradigm-shifting methodology for the photorealistic rendering of objects and environments, enabling the synthesis of novel viewpoints with remarkable fidelity. This is accomplished through the strategic utilization of object-centric camera poses characterized by significant inter-frame overlap. This paper explores a compelling, alternative utility of NeRF: the derivation of point clouds from aggregated urban landscape imagery. The transmutation of street-view data into point clouds is fraught with complexities, attributable to a nexus of interdependent variables. First, high-quality point cloud generation hinges on precise camera poses, yet many datasets suffer from inaccuracies in pose metadata. Also, the standard approach of NeRF is ill-suited for the distinct characteristics of street-view data from autonomous vehicles in vast, open settings. Autonomous vehicle cameras often record with limited overlap, leading to blurring, artifacts, and compromised pavement representation in NeRF-based point clouds. In this paper, we present NeRF2Points, a tailored NeRF variant for urban point cloud synthesis, notable for its high-quality output from RGB inputs alone. Our paper is supported by a bespoke, high-resolution 20-kilometer urban street dataset, designed for point cloud generation and evaluation. NeRF2Points adeptly navigates the inherent challenges of NeRF-based point cloud synthesis through the implementation of the following strategic innovations: (1) Integration of Weighted Iterative Geometric Optimization (WIGO) and Structure from Motion (SfM) for enhanced camera pose accuracy, elevating street-view data precision. (2) Layered Perception and Integrated Modeling (LPiM) is designed for distinct radiance field modeling in urban environments, resulting in coherent point cloud representations.	This paper presents NeRF2Points, a novel NeRF-based framework designed for generating high-quality, dense point clouds from street-view imagery, offering a cost-effective alternative to lidar systems.	Generating point clouds from street-view imagery is crucial for autonomous navigation, enhancing driving recognition algorithms, and improving simulation and data annotation. Existing methods struggle with inaccuracies in camera poses and the unique characteristics of street-view data.	NeRF2Points addresses challenges by: (1) Using a combination of WIGO and SfM for enhanced camera pose accuracy. (2) Implementing Layered Perception and Integrated Modeling (LPiM) to model road and street scene point clouds separately and then merge them. (3) Introducing geometric-aware consistency regularization (spatial dynamic and temporal invariant consistency) to address artifacts caused by sparse viewpoints.	NeRF2Points outperforms state-of-the-art NeRF methods in terms of point cloud accuracy (Chamfer Distance) and image quality (PSNR and SSIM) on a new 20km street-view dataset. The LPiM strategy effectively addresses pavement collapse, a common issue when generating point clouds from street-view data. Geometric-aware consistency regularization significantly reduces artifacts like floaters, blurriness, and geometric inconsistencies.	The impact of temporal invariant consistency regularization, while positive, is relatively small compared to other components. Future work will explore 4D point cloud reconstruction using NeRF2Points.	neural radiance fields, point cloud generation, street views, self-driving, 3d reconstruction
2404.04860 Report	ByteEdit: Boost, Comply and Accelerate Generative Image Editing	Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, Xuefeng Xiao, Yitong Wang, Min Zheng, Lean Fu	Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we present ByteEdit, an innovative feedback learning framework meticulously designed to Boost, Comply, and Accelerate Generative Image Editing tasks. ByteEdit seamlessly integrates image reward models dedicated to enhancing aesthetics and image-text alignment, while also introducing a dense, pixel-level reward model tailored to foster coherence in the output. Furthermore, we propose a pioneering adversarial and progressive feedback learning strategy to expedite the model's inference speed. Through extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses leading generative image editing products, including Adobe, Canva, and MeiTu, in both generation quality and consistency. ByteEdit-Outpainting exhibits a remarkable enhancement of 388% and 135% in quality and consistency, respectively, when compared to the baseline model. Experiments also verfied that our acceleration models maintains excellent performance results in terms of quality and consistency.	Introduces ByteEdit, a novel feedback learning framework to enhance diffusion-based generative image editing in terms of generation quality, consistency, instruction adherence, and speed.	Addresses the limitations of existing diffusion-based image editing methods, which often suffer from inferior quality, poor consistency, weak instruction adherence, and slow generation speed.	Utilizes perceptual feedback learning (PeFL) with aesthetic, alignment, and coherent reward models trained on large datasets with human feedback. It also introduces adversarial and progressive training strategies to accelerate the generation process.	Significantly outperforms existing state-of-the-art generative image editing products like Adobe, Canva, and MeiTu in terms of generation quality and consistency. Demonstrates a remarkable improvement of 388% and 135% in quality and consistency for outpainting compared to the baseline model. Achieves acceleration in inference speed while maintaining excellent performance in terms of quality and consistency.	Exploring more targeted reward models tailored to specific editing tasks to enhance performance. Investigating further integration with advanced techniques like LCM and SDXL-turbo for even faster processing speeds.	image editing, generative models, diffusion models, feedback learning, image outpainting, image inpainting
2404.04828 Report	Strictly-ID-Preserved and Controllable Accessory Advertising Image Generation	Youze Xue, Binghui Chen, Yifeng Geng, Xuansong Xie, Jiansheng Chen, Hongbing Ma	Customized generative text-to-image models have the ability to produce images that closely resemble a given subject. However, in the context of generating advertising images for e-commerce scenarios, it is crucial that the generated subject's identity aligns perfectly with the product being advertised. In order to address the need for strictly-ID preserved advertising image generation, we have developed a Control-Net based customized image generation pipeline and have taken earring model advertising as an example. Our approach facilitates a seamless interaction between the earrings and the model's face, while ensuring that the identity of the earrings remains intact. Furthermore, to achieve a diverse and controllable display, we have proposed a multi-branch cross-attention architecture, which allows for control over the scale, pose, and appearance of the model, going beyond the limitations of text prompts. Our method manages to achieve fine-grained control of the generated model's face, resulting in controllable and captivating advertising effects.	This paper proposes a Control-Net based pipeline for generating advertising images of accessories (specifically earrings) that strictly preserves the product's identity while offering fine-grained control over the model wearing them.	Existing customized text-to-image models struggle to strictly maintain product identity, crucial for e-commerce advertising. Current methods either fail to perfectly preserve product appearance or lack control over the model's pose, scale, and appearance for optimal advertising impact.	The pipeline leverages Control-Net with the earring image as conditioning, training on earring-model images to generate a contextually appropriate model face. It employs a multi-branch cross-attention architecture to control the model's scale, pose, and appearance independently. A standard-deviation based normalization (STD-Norm) mechanism and a time-dependent weighting (TDW) strategy balance the influence of different control branches.	The method generates strictly ID-preserved earring-model images, accurately retaining earring shape, size, and appearance. It achieves fine-grained control over the model's scale, pose, and appearance, surpassing textual control methods in accuracy and flexibility. Quantitative experiments and user studies confirm the superiority of the method in ID preservation, control effectiveness, and overall image quality compared to existing alternatives.	Current implementation relies on copying the earring image for strict ID-preservation, limiting automatic adjustments to earring rotation and lighting. Future work will explore techniques to enable automatic adaptation of the earring image while maintaining strict ID-preservation.	generative models, control-net, strictly-id-preservation, advertising image generation, e-commerce
2404.04650 Report	InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization	Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang	Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.	This paper proposes Initial Noise Optimization (InitNO) to improve the semantic fidelity of text-to-image diffusion models by optimizing the initial latent noise.	Existing diffusion models often produce images misaligned with text prompts, exhibiting subject neglect, mixing, and incorrect attribute binding.	InitNO partitions the initial latent space into valid/invalid regions based on cross-attention response and self-attention conflict scores. Then, it optimizes the noise distribution to steer it into the valid region while maintaining consistency with the standard Gaussian distribution.	InitNO significantly improves image-text alignment compared to state-of-the-art methods as measured by CLIP similarity scores. User studies confirm that InitNO generates more visually appealing and semantically accurate images. InitNO is a plug-and-play method, seamlessly integrating into existing diffusion models for training-free controllable generation tasks like layout-to-image synthesis.	InitNO incurs higher computational cost than the baseline Stable Diffusion model due to the noise optimization procedure. The selection of target tokens currently relies on manual input or external language models, which could be automated in future work.	text-to-image synthesis, diffusion models, latent space optimization, semantic fidelity, attention mechanisms
2404.04617 Report	Empowering Image Recovery_ A Multi-Attention Approach	Juan Wen, Yawei Li, Chao Zhang, Weiyan Hou, Radu Timofte, Luc Van Gool	We propose Diverse Restormer (DART), a novel image restoration method that effectively integrates information from various sources (long sequences, local and global regions, feature dimensions, and positional dimensions) to address restoration challenges. While Transformer models have demonstrated excellent performance in image restoration due to their self-attention mechanism, they face limitations in complex scenarios. Leveraging recent advancements in Transformers and various attention mechanisms, our method utilizes customized attention mechanisms to enhance overall performance. DART, our novel network architecture, employs windowed attention to mimic the selective focusing mechanism of human eyes. By dynamically adjusting receptive fields, it optimally captures the fundamental features crucial for image resolution reconstruction. Efficiency and performance balance are achieved through the LongIR attention mechanism for long sequence image restoration. Integration of attention mechanisms across feature and positional dimensions further enhances the recovery of fine details. Evaluation across five restoration tasks consistently positions DART at the forefront. Upon acceptance, we commit to providing publicly accessible code and models to ensure reproducibility and facilitate further research.	The paper presents Diverse Restormer (DART), a novel image restoration method using a multi-attention transformer to integrate information from various sources.	Existing transformer models, while effective, face limitations in handling complex scenarios and high-resolution images. This method aims to enhance image restoration by leveraging customized attention mechanisms.	DART employs a SwinIR backbone with key additions: 1) LongIR attention for efficient long sequence processing, 2) feature dimension attention for emphasizing relevant features, and 3) positional dimension attention for focusing on specific image regions.	DART achieves state-of-the-art results on synthetic data benchmarks for denoising and super-resolution, outperforming existing methods with fewer parameters. On real image restoration tasks like motion and defocus deblurring, DART surpasses current best-performing methods, showcasing its effectiveness on real-world challenges. Ablation studies confirm the contribution of each attention mechanism, and analysis highlights DART's ability to utilize a wider pixel range for superior reconstruction.	The paper acknowledges the potential for further efficiency improvements in the DART model. Future work may explore extending DART to other image restoration tasks beyond the ones evaluated.	image restoration, transformer, attention mechanism, deep learning, computer vision
2404.04562 Report	Diffusion Time-step Curriculum for One Image to 3D Generation	Xuanyu Yi, Zike Wu, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Hanwang Zhang	Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success, SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, high-quality, and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123.	This paper proposes DTC123, a novel diffusion time-step curriculum-based pipeline for enhancing the quality and consistency of single-image 3D generation using Score Distillation Sampling.	Existing SDS-based methods for single-image 3D generation suffer from geometric artifacts and texture saturation due to the indiscriminate treatment of diffusion time-steps during optimization.	DTC123 implements a coarse-to-fine optimization strategy guided by a diffusion time-step curriculum. This includes an annealed time-step schedule, progressive student representation (using NeRF and DMTet), and coarse-to-fine teacher guidance (combining Zero-1-to-3 and Stable Diffusion).	DTC123 generates multi-view consistent, high-fidelity 3D assets, outperforming state-of-the-art methods on benchmarks like NeRF4, RealFusion15, and GSO. The proposed time-step curriculum significantly improves the robustness of the generation process, reducing failures like Janus faces and geometric distortions. DTC123 enables multi-instance generation and 3D editing through user-specified prompts.	The current implementation relies on a two-stage approach with different 3D representations for efficiency, which can be further explored for end-to-end generation. Exploring more advanced teacher diffusion models and student 3D representations could further improve generation quality.	3d generation, score distillation sampling, diffusion models, single-image 3d reconstruction, time-step curriculum
2404.04544 Report	BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion	Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun	Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.	This supplementary material details BeyondScene, a novel framework for generating high-resolution, human-centric scenes from text descriptions and poses. The method excels in capturing fine details and ensuring text-image correspondence.	Existing methods for generating large-scale human-centric scenes struggle with limitations in capturing fine details, maintaining text-image consistency, and controlling human instance generation. BeyondScene addresses these challenges, presenting a solution for producing high-fidelity images that accurately represent complex scenes.	BeyondScene employs a two-stage process: (1) Detailed Base Image Generation: Human instances are generated based on text and pose using SDXL-ControlNet-Openpose, segmented, and then seamlessly integrated into a background. (2) Instance-Aware Hierarchical Enlargement: The base image is progressively upsampled using High Frequency-Injected Forward Diffusion and Adaptive Joint Diffusion, employing adaptive stride and conditioning for detailed refinement.	BeyondScene outperforms baselines in generating high-resolution (up to 8K) human-centric scenes with superior text-image correspondence and naturalness, as evidenced by user studies and MLLM-based evaluations. The method demonstrates superior performance in capturing fine details and anatomical accuracy compared to combining ControlNet with super-resolution techniques. BeyondScene achieves comparable or greater efficiency in terms of GPU memory usage and FLOPs compared to existing joint diffusion methods.	While BeyondScene shows promising results, it currently relies on pretrained models like SDXL, potentially inheriting some of their limitations. Further research could explore expanding the framework to incorporate diverse human appearances, encompassing different ethnicities and body types.	image generation, human-centric scene synthesis, high-resolution images, text-to-image generation, diffusion models
2404.04526 Report	DATENeRF: Depth-Aware Text-based Editing of NeRFs	Sara Rojas, Julien Philip, Kai Zhang, Sai Bi, Fujun Luan, Bernard Ghanem, Kalyan Sunkavall	Recent advancements in diffusion models have shown remarkable proficiency in editing 2D images based on text prompts. However, extending these techniques to edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual 2D frames can result in inconsistencies across multiple views. Our crucial insight is that a NeRF scene's geometry can serve as a bridge to integrate these 2D edits. Utilizing this geometry, we employ a depth-conditioned ControlNet to enhance the coherence of each 2D image modification. Moreover, we introduce an inpainting approach that leverages the depth information of NeRF scenes to distribute 2D edits across different images, ensuring robustness against errors and resampling challenges. Our results reveal that this methodology achieves more consistent, lifelike, and detailed edits than existing leading methods for text-driven NeRF scene editing.	Presents DATENeRF, a method for consistent text-driven editing of NeRF scenes, leveraging depth-aware ControlNet and a novel projection inpainting scheme.	Existing methods struggle to maintain consistency and quality when editing NeRFs using text prompts, leading to blurry textures and geometric artifacts. DATENeRF addresses these challenges by explicitly using depth information for consistent 2D edits, ultimately leading to higher-quality NeRF editing.	DATENeRF utilizes a depth-conditioned ControlNet for inpainting masked regions of individual input images. To ensure consistency, the method projects edited pixels from a reference view to other views using the NeRF depth, employing a hybrid inpainting scheme to refine the results and mitigate reprojection artifacts. Finally, an edited NeRF is optimized using the consistent 2D edits.	DATENeRF generates more realistic and detailed edits than previous methods, accurately reflecting the input text prompts. The use of depth information significantly improves the consistency of edits across multiple views. The method demonstrates faster convergence compared to state-of-the-art techniques like Instruct-NeRF2NeRF.	Limited to edits that do not involve significant geometric changes in the scene. Performance depends on the accuracy of the NeRF geometry and the editing model's ability to generate content consistent with the depth maps, particularly in complex, large-scale scenes.	nerf editing, text-based editing, diffusion models, controlnet, 3d scene editing
2404.04478 Report	Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models	Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang	Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks, referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. Experimental results on both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.	This paper introduces Diffusion-RWKV, adapting the RWKV architecture from NLP for image generation tasks using diffusion models. Diffusion-RWKV efficiently handles long-range dependencies in image data while maintaining linear computational complexity, making it a computationally efficient alternative to Transformer-based diffusion models.	Transformers, while powerful, face limitations in high-resolution image generation due to their quadratic computational complexity, especially with long sequences. This necessitates exploring alternative architectures that offer comparable performance with reduced computational demands.	Diffusion-RWKV leverages a bidirectional RWKV (Bi-RWKV) backbone for sequential image data processing. It incorporates modifications like image patchnification, skip connections between Bi-RWKV blocks, and different conditional information incorporation techniques (in-context, adaLN, adaLN-Zero). The study analyzes the computational complexity of Diffusion-RWKV and explores various model configurations and scaling options.	Diffusion-RWKV achieves FID scores comparable to Transformer-based diffusion models (like DiT and U-ViT) on CIFAR-10 and CelebA datasets while using fewer parameters. Ablation studies demonstrate the impact of patch size, skip connections, and conditioning methods on model performance, with smaller patch sizes and the adaLN-Zero block proving beneficial. On ImageNet, Diffusion-RWKV exhibits strong performance for class-conditional image generation at resolutions of 256x256 and 512x512, achieving competitive FID scores with reduced computational cost compared to DiT.	Future work could explore integrating advanced strategies from transformer-based models (e.g., from SiT) into the Diffusion-RWKV backbone. Further investigation into optimizing the model for even higher-resolution image generation is warranted.	image generation, diffusion models, rwkv, linear complexity, transformer alternative
2404.04474 Report	RoNet: Rotation-oriented Continuous Image Translation	Yi Li, Xin Xie, Lina Lei, Haiyan Fu, Yanqing Guo	The generation of smooth and continuous images between domains has recently drawn much attention in image-to-image (I2I) translation. Linear relationship acts as the basic assumption in most existing approaches, while applied to different aspects including features, models or labels. However, the linear assumption is hard to conform with the element dimension increases and suffers from the limit that having to obtain both ends of the line. In this paper, we propose a novel rotation-oriented solution and model the continuous generation with an in-plane rotation over the style representation of an image, achieving a network named RoNet. A rotation module is implanted in the generation network to automatically learn the proper plane while disentangling the content and the style of an image. To encourage realistic texture, we also design a patch-based semantic style loss that learns the different styles of the similar object in different domains. We conduct experiments on forest scenes (where the complex texture makes the generation very challenging), faces, streetscapes and the iphone2dslr task. The results validate the superiority of our method in terms of visual quality and continuity.	This paper proposes RoNet, a novel rotation-oriented network for continuous image-to-image translation that overcomes limitations of linear interpolation methods.	Continuous image translation with smooth transitions between domains is challenging, and existing linear interpolation methods suffer from limitations like requiring both source and target domain data and struggling with high-dimensional data.	RoNet uses a rotation module to learn a rotation plane for style representations, enabling continuous generation by rotating the representation within this plane. It also introduces a patch-based semantic style loss to improve texture realism.	RoNet generates realistic and continuous image translations across various domains like seasons, time of day, and artistic styles. It outperforms existing methods in both visual quality and quantitative metrics like LPIPS, FID, and KID. Ablation studies demonstrate the effectiveness of each component, particularly the rotation module and the semantic style loss.	The paper mainly focuses on cyclic translations and could explore more complex manifold representations. Further research on automatically handling imbalanced datasets in continuous image translation is promising.	image-to-image translation, continuous generation, style representation, rotation module, semantic style loss
2404.04469 Report	Mixed-Query Transformer: A Unified Image Segmentation Architecture	Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto	Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights. To enable this, we propose a mixed query strategy, which can effectively and dynamically accommodate different types of objects without heuristic designs. In addition, the unified architecture allows us to use data augmentation with synthetic masks and captions to further improve model generalization. Experiments demonstrate that MQ-Former can not only effectively handle multiple segmentation datasets and tasks compared to specialized state-of-the-art models with competitive performance, but also generalize better to open-set segmentation tasks, evidenced by over 7 points higher performance than the prior art on the open-vocabulary SeginW benchmark.	This paper introduces MQ-Former, a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights, enabled by a novel mixed query strategy.	Existing unified segmentation models are limited to either separate weights for each dataset or a single task, hindering their ability to leverage diverse information across tasks and datasets for real-world open-world applications.	MQ-Former employs a mixed query strategy combining learnable and conditional queries, dynamically matched to objects via Hungarian matching, eliminating the need for heuristic thing/stuff class distinction. It is trained jointly on multiple datasets and tasks, further enhanced by incorporating synthetic masks and captions.	MQ-Former effectively handles multiple segmentation tasks and datasets with competitive performance compared to specialized models. It demonstrates superior generalization to open-set segmentation, outperforming the state-of-the-art by over 7 points on the SeginW benchmark. The use of synthetic data significantly improves performance, highlighting its potential for addressing data scarcity in segmentation.	MQ-Former currently lacks explicit support for reasoning segmentation tasks requiring complex reasoning abilities. The paper doesn't explore cross-modality feature fusion, which could further enhance performance but at the cost of increased computational resources.	image segmentation, unified architecture, multi-task learning, multi-dataset training, synthetic data
2404.04465 Report	Aligning Diffusion Models by Optimizing Human Utility	Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka	We present Diffusion-KTO, a novel approach for aligning text-to-image diffusion models by formulating the alignment objective as the maximization of expected human utility. Since this objective applies to each generation independently, Diffusion-KTO does not require collecting costly pairwise preference data nor training a complex reward model. Instead, our objective requires simple per-image binary feedback signals, e.g. likes or dislikes, which are abundantly available. After fine-tuning using Diffusion-KTO, text-to-image diffusion models exhibit superior performance compared to existing techniques, including supervised fine-tuning and Diffusion-DPO, both in terms of human judgment and automatic evaluation metrics such as PickScore and ImageReward. Overall, Diffusion-KTO unlocks the potential of leveraging readily available per-image binary signals and broadens the applicability of aligning text-to-image diffusion models with human preferences.	Presents Diffusion-KTO, a novel approach for aligning text-to-image diffusion models with human preference using per-sample binary feedback (likes or dislikes) and without training a reward model.	Addresses the limitations of existing alignment methods that require expensive pairwise preference data or complex reward model training. Leverages readily available binary feedback to improve the alignment and applicability of text-to-image models.	Extends the human utility maximization framework to diffusion models by optimizing a utility function based on the implicit reward of each step in the denoising process. Explores various utility functions, finding the Kahneman & Tversky model most effective.	Diffusion-KTO significantly improves image quality and alignment with human preferences, as judged by both human evaluation and automated metrics. Outperforms existing alignment methods, including supervised fine-tuning, Diffusion-DPO, and AlignProp, across various metrics like PickScore, HPS v2, and ImageReward. Demonstrates potential for aligning models with individual user preferences through synthetic experiments simulating custom heuristics.	Inherits potential biases and limitations present in the training data and the base text-to-image model. Exploration of alternative utility functions and their impact on alignment remains an open question.	text-to-image synthesis, diffusion models, human preference learning, utility maximization, binary feedback
2404.04421 Report	PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations	Yang Zheng, Qingqing Zhao, Guandao Yang, Wang Yifan, Donglai Xiang, Florian Dubost, Dmitry Lagun, Thabo Beeler, Federico Tombari, Leonidas Guibas, Gordon Wetzstein	Modeling and rendering photorealistic avatars is of crucial importance in many applications. Existing methods that build a 3D avatar from visual observations, however, struggle to reconstruct clothed humans. We introduce PhysAvatar, a novel framework that combines inverse rendering with inverse physics to automatically estimate the shape and appearance of a human from multi-view video data along with the physical parameters of the fabric of their clothes. For this purpose, we adopt a mesh-aligned 4D Gaussian technique for spatio-temporal mesh tracking as well as a physically based inverse renderer to estimate the intrinsic material properties. PhysAvatar integrates a physics simulator to estimate the physical parameters of the garments using gradient-based optimization in a principled manner. These novel capabilities enable PhysAvatar to create high-quality novel-view renderings of avatars dressed in loose-fitting clothes under motions and lighting conditions not seen in the training data. This marks a significant advancement towards modeling photorealistic digital humans using physically based inverse rendering with physics in the loop. Our project website is at: https://qingqing-zhao.github.io/PhysAvatar	Introduces PhysAvatar, a novel framework that reconstructs dressed 3D avatars from multi-view video, accurately modeling garment physics and appearance.	Existing methods struggle to realistically model loose-fitting clothes, neglecting physically accurate garment dynamics.	Combines mesh tracking with 4D Gaussians, physics-based parameter optimization using a simulator, and appearance refinement via inverse rendering.	Outperforms state-of-the-art methods in geometry accuracy, capturing fine wrinkle details. Achieves superior appearance quality, particularly in capturing high-frequency details. Enables animation, relighting, and redressing, and is compatible with traditional graphics pipelines.	Relies on manual garment segmentation and mesh UV unwrapping. Limited by the accuracy of the SMPL-X body model used as a collider.	neural rendering, physics simulation, 3d avatar, inverse rendering, garment modeling
2404.04376 Report	ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing	Alec Helbling, Seongmin Lee, Polo Chau	Recently, researchers have proposed powerful systems for generating and manipulating images using natural language instructions. However, it is difficult to precisely specify many common classes of image transformations with text alone. For example, a user may wish to change the location and breed of a particular dog in an image with several similar dogs. This task is quite difficult with natural language alone, and would require a user to write a laboriously complex prompt that both disambiguates the target dog and describes the destination. We propose ClickDiffusion, a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface. We demonstrate that by serializing both an image and a multi-modal instruction into a textual representation it is possible to leverage LLMs to perform precise transformations of the layout and appearance of an image. Code available at https://github.com/poloclub/ClickDiffusion.	ClickDiffusion, an interactive image editing system that combines natural language instructions with visual feedback through direct manipulation for precise image editing.	Existing text-based image editing methods lack precision for complex tasks that require object disambiguation and specific location editing. Direct manipulation alone is inflexible and limited to predefined operations.	The system serializes multi-modal instructions (text + bounding boxes) into a textual format, processes them using an LLM with in-context learning, manipulates an intermediate image layout, and generates the final edited image using layout-based image generation (GLIGEN).	Enables precise object manipulation and appearance changes within complex scenes. Simplifies complex editing tasks compared to text-only methods. Leverages LLMs' few-shot learning abilities for generalization to diverse editing operations.	Limited user study and quantitative evaluation. Reliance on layout-based image generation may inherit limitations of such methods.	image editing, direct manipulation, natural language processing, large language models, human-computer interaction
2404.04363 Report	Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs	Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao	In this paper, we pursue a novel 3D AIGC setting: generating 3D content from IDEAs. The definition of an IDEA is the composition of multimodal inputs including text, image, and 3D models. To our knowledge, this challenging and appealing 3D AIGC setting has not been studied before. We propose the novel framework called Idea-2-3D to achieve this goal, which consists of three agents based upon large multimodel models (LMMs) and several existing algorithmic tools for them to invoke. Specifically, these three LMM-based agents are prompted to do the jobs of prompt generation, model selection and feedback reflection. They work in a cycle that involves both mutual collaboration and criticism. Note that this cycle is done in a fully automatic manner, without any human intervention. The framework then outputs a text prompt to generate 3D models that well align with input IDEAs. We show impressive 3D AIGC results that are beyond any previous methods can achieve. For quantitative comparisons, we construct caption-based baselines using a whole bunch of state-of-the-art 3D AIGC models and demonstrate Idea-2-3D out-performs significantly. In 94.2% of cases, Idea-2-3D meets users' requirements, marking a degree of match between IDEA and 3D models that is 2.3 times higher than baselines. Moreover, in 93.5% of the cases, users agreed that Idea-2-3D was better than baselines. Codes, data and models will made publicly available.	This paper introduces Idea-2-3D, a novel framework employing LMM (Large Multimodal Model) agents to automatically generate 3D models from interleaved multimodal inputs called IDEAs (combinations of text, images, and 3D models).	Existing 3D AIGC models primarily rely on single-modality inputs (text or image) and struggle to capture the complexity of human creative ideas often expressed through a blend of modalities. Idea-2-3D bridges this gap, enabling a more natural and expressive way to design in 3D.	Idea-2-3D leverages three LMM agents powered by GPT-4V for prompt generation, 3D model selection, and feedback reflection. It iteratively refines the generated 3D model by converting it into multi-view images, feeding them back to the LMM agents, and leveraging a memory module to learn from previous iterations.	Idea-2-3D significantly outperforms caption-based T-2-3D baselines in user preference studies, demonstrating higher alignment with user IDEAs. The iterative self-refinement process in Idea-2-3D leads to incremental improvements in the generated 3D models, effectively capturing and translating complex multimodal design concepts. Ablation studies highlight the importance of the memory module, feedback mechanism, and storage of previous models in achieving high-quality and convergent 3D model generation.	The reliance on closed-source LMMs like GPT-4V poses limitations in terms of accessibility and reproducibility. Future work could explore alternative open-source LMMs or investigate methods to reduce the dependency on proprietary models while maintaining performance.	lmm agents, 3d aigc, automated 3d design, multimodal learning, generative ai
2404.04346 Report	Koala: Key frame-conditioned long video-LLM	Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko	Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.	Introduces Koala, a lightweight approach to adapt pretrained short video-LLMs to understand and answer questions about minutes-long videos by conditioning on sparsely sampled key frames.	Existing video-LLMs, despite being trained on millions of short videos, struggle to understand and answer questions about minutes-long videos.	Introduces Conditioned Segment (CS) and Conditioned Video (CV) tokenizer functions that leverage global video context from coarsely sampled key frames to aggregate fine-grained spatiotemporal information from local video segments.	Outperforms state-of-the-art large models by 3-6% in absolute accuracy on zero-shot long video understanding benchmarks (EgoSchema and Seed-Bench). Demonstrates improved accuracy on short-term action recognition, suggesting benefits for both short and long-term video understanding. Empirical analysis highlights the importance of global context conditioning and spatiotemporal queries in the proposed tokenizer functions.	Limited scalability to extremely long videos (e.g., movies) due to the maximum input token limit of pretrained LLMs. Potential for further improvement by incorporating curated descriptive annotations during the final finetuning stage.	video-llm, long-form video understanding, spatiotemporal reasoning, key frame conditioning, multimodal learning
2404.04319 Report	SpatialTracker: Tracking Any 2D Pixels in 3D Space	Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, Xiaowei Zhou	Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate, we posit that the underlying 3D motion can often be simple and low-dimensional. In this work, we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators, represents the 3D content of each frame efficiently using a triplane representation, and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively, particularly in challenging scenarios such as out-of-plane rotation.	Presents SpatialTracker, a novel method for dense, long-range 2D motion tracking in videos by lifting pixels to 3D and performing tracking with a triplane representation and a learned as-rigid-as-possible (ARAP) constraint.	Existing 2D tracking methods struggle with occlusions and complex deformations, issues which are mitigated by leveraging the inherent 3D nature of motion and 3D motion priors like the ARAP constraint.	Uses monocular depth estimation to lift 2D pixels to 3D, represents each frame's 3D content with triplane feature maps, iteratively predicts 3D trajectories using a transformer, and enforces ARAP regularization with a learned rigidity embedding.	Achieves state-of-the-art 2D tracking performance on TAP-Vid, BADJA, and PointOdyssey benchmarks. Shows superior qualitative results in handling complex motion and occlusions on challenging videos. Demonstrates accurate 3D trajectory estimation when ground truth depth is available.	Performance relies on the accuracy of off-the-shelf monocular depth estimators. Future work can explore joint learning of depth and motion to further enhance tracking.	motion tracking, 3d reconstruction, triplane representation, arap constraint, monocular depth estimation
2404.04256 Report	Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation	Zifu Wan, Yuhao Wang, Silong Yong, Pingping Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie	Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable segmentation. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation, utilizing the Selective Structured State Space Model, Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields coverage with linear complexity. By employing a Siamese encoder and innovating a Mamba fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our method, Sigma, is rigorously evaluated on both RGB-Thermal and RGB-Depth segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.	Introduces Sigma, a Siamese Mamba network for multi-modal semantic segmentation using the Selective Structured State Space Model (Mamba) for efficient global receptive field coverage with linear complexity.	Multi-modal semantic segmentation is crucial for AI agents in challenging conditions, but existing CNN and ViT-based methods have limitations in receptive field size and complexity. Mamba offers a solution with global receptive fields and linear complexity.	Sigma uses a Siamese encoder with cascaded Visual State Space (VSS) Blocks for feature extraction. A fusion module with Cross and Concat Mamba Blocks aggregates information from different modalities. A channel-aware Mamba decoder refines the features for segmentation.	Sigma outperforms state-of-the-art models on RGB-Thermal and RGB-Depth semantic segmentation tasks in terms of accuracy and efficiency. The proposed Mamba fusion mechanism effectively integrates multi-modal information while significantly reducing computational demand compared to Transformer-based fusion. Ablation studies validate the contribution of each component, especially the fusion module and the channel-aware decoder.	Current implementation focuses on two modalities, potentially underutilizing Mamba's capacity for longer sequences. Memory consumption in the Mamba encoder, specifically the four-directional scanning, poses challenges for deployment on lightweight edge devices.	multi-modal learning, semantic segmentation, state space models, vision mamba, siamese networks
2404.04242 Report	Physical Property Understanding from Language-Embedded Feature Fields	Albert J. Zhai, Yuan Shen, Emily Y. Chen, Gloria X. Wang, Xinlei Wang, Sheng Wang, Kaiyu Guan, Shenlong Wang	Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.	Presents NeRF2Physics, a training-free approach for uncertainty-aware dense prediction of physical properties from images.	Crucial for various applications like robotics, agriculture, urban planning, and graphics to perceive physics from visual data.	1. Extracts a language-embedded point cloud from a neural radiance field fused with CLIP features. 2. Prompts an LLM to propose candidate materials and their properties. 3. Employs zero-shot CLIP-based kernel regression for per-point property estimation. 4. Aggregates properties for object-level estimates like mass using LLM-based thickness estimations.	Outperforms baselines on mass estimation using the ABO dataset. Produces reasonable predictions for diverse physical properties like friction and hardness on a real-world dataset. Enables creation of physically realistic digital twins for various applications.	Limited ability to reason about occluded object parts. Potential for material recognition errors when local appearances are ambiguous.	physical property estimation, vision-language models, neural radiance fields, zero-shot learning, digital twins
2404.04211 Report	Robust Gaussian Splatting	François Darmon, Lorenzo Porzi, Samuel Rota-Bulò, Peter Kontschieder	In this paper, we address common error sources for 3D Gaussian Splatting (3DGS) including blur, imperfect camera poses, and color inconsistencies, with the goal of improving its robustness for practical applications like reconstructions from handheld phone captures. Our main contribution involves modeling motion blur as a Gaussian distribution over camera poses, allowing us to address both camera pose refinement and motion blur correction in a unified way. Additionally, we propose mechanisms for defocus blur compensation and for addressing color in-consistencies caused by ambient light, shadows, or due to camera-related factors like varying white balancing settings. Our proposed solutions integrate in a seamless way with the 3DGS formulation while maintaining its benefits in terms of training efficiency and rendering speed. We experimentally validate our contributions on relevant benchmark datasets including Scannet++ and Deblur-NeRF, obtaining state-of-the-art results and thus consistent improvements over relevant baselines.	The paper proposes a robust Gaussian Splatting (3DGS) method resilient to blur, imperfect camera poses, and color inconsistencies common in real-world captures.	Existing neural rendering methods, including 3DGS, often falter with real-world data exhibiting blur, inaccurate camera poses, and color inconsistencies, limiting their practicality.	The method models motion blur as a Gaussian distribution over camera poses, introduces a defocus blur compensation mechanism using an offset correction to the 2D Gaussian covariances, and addresses color inconsistencies via an RGB decoder function with per-image affine color transformations.	The proposed approach surpasses baselines in perceptual metrics (SSIM, LPIPS) on a real-world benchmark derived from the ScanNet++ dataset. Ablation studies validate the individual contributions of color transformation, pose optimization, and blur modeling. While achieving competitive performance on the synthetic DeblurNeRF dataset, the method lags behind state-of-the-art NeRF-based methods, potentially due to their stronger regularization.	The method doesn't address dynamic blur from non-static objects. The issue of poor 3DGS generalization to viewpoints far from the training trajectory remains unaddressed.	neural rendering, 3d gaussian splatting, motion blur, defocus blur, color consistency
2404.04057 Report	Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation	Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, Hai Huang	We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator. SiD not only facilitates an exponentially fast reduction in Fr\'echet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models. By reformulating forward diffusion processes as semi-implicit distributions, we leverage three score-related identities to create an innovative loss mechanism. This mechanism achieves rapid FID reduction by training the generator using its own synthesized images, eliminating the need for real data or reverse-diffusion-based generation, all accomplished within significantly shortened generation time. Upon evaluation across four benchmark datasets, the SiD algorithm demonstrates high iteration efficiency during distillation and surpasses competing distillation approaches, whether they are one-step or few-step, data-free, or dependent on training data, in terms of generation quality. This achievement not only redefines the benchmarks for efficiency and effectiveness in diffusion distillation but also in the broader field of diffusion-based generation. The PyTorch implementation is available at https://github.com/mingyuanzhou/SiD	This paper introduces Score identity Distillation (SiD), a novel data-free method for distilling pretrained diffusion models into single-step generators, achieving fast distillation and high-quality generation exceeding the original model.	Diffusion models, while powerful, suffer from slow multi-step generation. SiD addresses this by enabling single-step generation while maintaining or improving upon the original model's quality, offering significant efficiency gains.	SiD leverages a novel perspective of forward diffusion as semi-implicit distributions. It introduces three score-related identities to formulate a loss mechanism, approximating a model-based score-matching loss using score estimation and Monte Carlo methods.	SiD achieves exponentially fast reduction in FID during distillation, surpassing competing methods in efficiency. The SiD-trained single-step generator approaches or surpasses the FID performance of the original multi-step teacher diffusion model. Evaluations on four benchmark datasets show SiD's superior performance over existing single and multi-step generators, both data-free and data-dependent.	SiD requires managing three networks during distillation, leading to higher memory demands than traditional diffusion model training. While SiD outperforms competitors on most benchmarks, further investigation is needed for scenarios like ImageNet 64x64, where it currently lags behind the teacher model.	diffusion distillation, score matching, deep generative models, single-step generation, semi-implicit distributions
2404.04037 Report	InstructHumans: Editing Animated 3D Human Textures with Instructions	Jiayin Zhu, Linlin Yang, Angela Yao	We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. We further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing. InstructHumans significantly outperforms existing 3D editing methods, consistent with the initial avatar while faithful to the textual instructions. Project page: https://jyzhu.top/instruct-humans .	This paper introduces InstructHumans, a novel framework for instruction-driven 3D human texture editing, enabling users to modify animatable human avatars using text instructions.	Existing text-driven 3D human editing methods are limited to non-animatable avatars or suffer from poor texture quality, failing to balance consistency with the original avatar and adherence to textual instructions. This work addresses these limitations by proposing a new method specifically tailored for editing animatable 3D humans with high fidelity and faithfulness to instructions.	The paper proposes SDS for Editing (SDS-E), a customized score distillation sampling method for 3D editing. SDS-E selectively incorporates subterms of SDS across different diffusion timesteps, addressing the limitations of naive SDS application. The framework further enhances editing quality and efficiency using a Laplacian smoothness regularizer to maintain texture coherence and a gradient-aware viewpoint sampling strategy to optimize editing efforts.	InstructHumans effectively edits 3D human textures based on text instructions while preserving the original avatar's identity and animation capability. The proposed SDS-E method successfully distills editing guidance by selectively applying SDS terms across timesteps, outperforming naive SDS in editing quality. The Laplacian smoothness regularizer and gradient-aware viewpoint sampling further improve the editing outcome and efficiency, respectively.	The framework depends on a hybrid human representation that might limit capturing high-frequency details, leading to potential artifacts. Adopting higher-resolution mesh templates and training with larger datasets are suggested as solutions. The disentanglement of textural and geometric changes with 2D guidance remains a challenge. Future research could focus on addressing the texture-geometry ambiguity for more comprehensive 3D human editing.	3d human texture editing, text-guided editing, score distillation sampling, animatable avatars, diffusion models
2404.03836 Report	PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model	Amrin Kareem, Jean Lahoud, Hisham Cholakkal	Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are available at https://github.com/AmrinKareem/PARIS3D.	This paper introduces a new 3D segmentation task called reasoning part segmentation, demanding models to understand implicit textual queries and reason about 3D object parts for segmentation.	This task is important for developing intelligent perception systems capable of understanding implicit user intentions and reasoning in 3D contexts, crucial for applications like robotics and human-robot interaction.	A new dataset, RPSeg3D, is created with reasoning-based instructions and corresponding part segmentation masks for 3D objects. PARIS3D, a multimodal LLM-based model, is proposed, which takes multi-view images of a 3D object and a textual query as input, leverages a vision backbone and LLM for reasoning and explanation generation, and outputs a segmentation mask and textual explanation.	PARIS3D achieves competitive performance on RPSeg3D, demonstrating its ability to reason and segment based on implicit queries. The model outperforms baselines in 3D semantic segmentation, particularly when provided with 3D information in the queries. PARIS3D generalizes to real-world point clouds captured with smartphone LiDAR sensors, showing its practical applicability.	The current model does not handle instance segmentation, presenting a direction for future work. Expanding the dataset to include more complex scenes and object interactions would further enhance the model's capabilities.	3d vision-language models, reasoning, 3d part segmentation, multimodal learning, dataset
2404.03799 Report	Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation	Elham Amin Mansour, Ozan Unal, Suman Saha, Benjamin Bejar, Luc Van Gool	The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.	This paper proposes LIDAPS, a novel language-guided instance-aware domain-adaptive panoptic segmentation model. It introduces two key components: IMix, an instance-aware cross-domain mixing strategy, and CDA, a CLIP-based domain alignment mechanism, to enhance panoptic segmentation performance in unsupervised domain adaptation.	The deployment of panoptic segmentation models is limited by the expensive nature of dense data annotation and the domain gap between datasets. This work addresses these challenges by proposing a novel approach to adapt both semantic and instance segmentation tasks, improving the model's ability to generalize to new domains.	The study introduces IMix, which pastes high-confidence predicted instances from the target domain onto source images, improving instance segmentation while reducing confirmation bias. To mitigate catastrophic forgetting in the semantic branch, CDA aligns both domains with a pre-trained CLIP model using per-pixel text similarity maps.	LIDAPS achieves state-of-the-art results on popular panoptic UDA benchmarks, surpassing previous methods by up to +3.6 mPQ. IMix significantly improves instance segmentation performance by simplifying the recognition of target objects and ensuring exhaustive pseudo-label generation. CDA effectively mitigates catastrophic forgetting in the semantic branch by aligning source and target domains with the CLIP embedding space.	The confidence threshold for pseudo-mask filtering in IMix needs to be manually determined for different source-target domain pairs. The refinement phase with IMix introduces additional computational overhead during training.	unsupervised domain adaptation, panoptic segmentation, instance segmentation, semantic segmentation, clip
2404.03736 Report	SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer	Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, Xiang Bai	Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. Moreover, we introduce Adaptive Gaussian (AG) initialization and Gaussian Alignment (GA) loss to mitigate shape degeneration issue, ensuring the fidelity of the learned motion and shape. Comprehensive experimental results demonstrate that our method surpasses existing methods in both quality and efficiency. In addition, facilitated by the disentangled modeling of motion and appearance of SC4D, we devise a novel application that seamlessly transfers the learned motion onto a diverse array of 4D entities according to textual descriptions.	This paper introduces SC4D, a novel video-to-4D generation framework that leverages sparse control points and dense 3D Gaussians to disentangle motion and appearance for improved 4D object generation from single-view videos.	Existing video-to-4D methods struggle to balance reference view alignment, spatio-temporal consistency, and motion fidelity due to limitations in representing dynamic 3D objects.	SC4D employs a two-stage approach: a coarse stage to initialize sparse control points and their motion, followed by a fine stage that optimizes dense Gaussians guided by the control points using Linear Binding Skinning. Adaptive Gaussian initialization and Gaussian Alignment loss are introduced to mitigate shape degeneration during training.	SC4D outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior reference view alignment, spatio-temporal consistency, and motion fidelity. SC4D exhibits efficiency in training, requiring significantly less time compared to existing approaches. The disentangled motion representation enables a novel application for motion transfer, allowing the generation of diverse 4D entities with consistent motion based on text descriptions.	SC4D's reliance on novel view synthesis models like Zero123 limits its performance on complex objects where such models may struggle. The current framework is restricted to static camera scenarios and does not account for moving camera viewpoints, presenting an area for future exploration.	video-to-4d generation, dynamic gaussian splatting, motion transfer, sparse control points, shape degeneration
2404.03658 Report	Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	Rui Li, Tobias Fischer, Mattia Segu, Marc Pollefeys, Luc Van Gool, Federico Tombari	Recovering the 3D scene geometry from a single view is a fundamental yet ill-posed problem in computer vision. While classical depth estimation methods infer only a 2.5D scene representation limited to the image plane, recent approaches based on radiance fields reconstruct a full 3D representation. However, these methods still struggle with occluded regions since inferring geometry without visual observation requires (i) semantic knowledge of the surroundings, and (ii) reasoning about spatial context. We propose KYN, a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density. We introduce a vision-language modulation module to enrich point features with fine-grained semantic information. We aggregate point representations across the scene through a language-guided spatial attention mechanism to yield per-point density predictions aware of the 3D semantic context. We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation. We achieve state-of-the-art results in scene and object reconstruction on KITTI-360, and show improved zero-shot generalization compared to prior work. Project page: https://ruili3.github.io/kyn.	Proposes KYN, a single-view scene reconstruction method that leverages semantic and spatial context to predict 3D point density.	Existing methods struggle to accurately reconstruct occluded regions due to a lack of semantic understanding and spatial context.	Introduces a vision-language modulation module to enrich point features with semantic information and a vision-language spatial attention mechanism to aggregate these features across the scene.	KYN achieves state-of-the-art scene and object reconstruction results on KITTI-360. KYN exhibits superior performance in reconstructing occluded regions compared to previous methods. KYN demonstrates improved zero-shot generalization on the DDAD dataset.	The performance of KYN is limited by the quality of the semantic segmentation. The memory footprint of the spatial attention mechanism can be further optimized.	single-view reconstruction, 3d scene understanding, vision-language learning, spatial attention, semantic segmentation
2404.03657 Report	OW-VISCap: Open-World Video Instance Segmentation and Captioning	Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing	Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.	Proposes OW-VISCap, an approach to jointly segment, track, and caption both previously seen and unseen objects in a video, addressing limitations of existing open-world video instance segmentation methods.	Open-world video instance segmentation is crucial for applications like autonomous systems and AR/VR, but existing methods have limitations in handling unseen objects and generating rich descriptions.	Introduces open-world object queries to discover unseen objects, uses a masked attention augmented LLM for object-centric captioning, and implements an inter-query contrastive loss to ensure diverse object queries.	Achieves state-of-the-art performance on uncommon categories in the BURST dataset for open-world video instance segmentation. Shows significant improvement in captioning accuracy for detected objects on the VidSTG dataset for dense video object captioning. Performs competitively on closed-world video instance segmentation on the OVIS dataset.	Object detection and caption generation struggle with small objects or objects under prolonged occlusion. Future work includes exploring stronger object discovery strategies, improved caption generators, and integrating more robust object trackers.	open-world video instance segmentation, object-centric captioning, open-world object queries, masked attention, contrastive loss
2404.03654 Report	RaFE: Generative Radiance Fields Restoration	Zhongkai Wu, Ziyu Wan, Jing Zhang, Jing Liao, Dong Xu	NeRF (Neural Radiance Fields) has demonstrated tremendous potential in novel view synthesis and 3D reconstruction, but its performance is sensitive to input image quality, which struggles to achieve high-fidelity rendering when provided with low-quality sparse input viewpoints. Previous methods for NeRF restoration are tailored for specific degradation type, ignoring the generality of restoration. To overcome this limitation, we propose a generic radiance fields restoration pipeline, named RaFE, which applies to various types of degradations, such as low resolution, blurriness, noise, compression artifacts, or their combinations. Our approach leverages the success of off-the-shelf 2D restoration methods to recover the multi-view images individually. Instead of reconstructing a blurred NeRF by averaging inconsistencies, we introduce a novel approach using Generative Adversarial Networks (GANs) for NeRF generation to better accommodate the geometric and appearance inconsistencies present in the multi-view images. Specifically, we adopt a two-level tri-plane architecture, where the coarse level remains fixed to represent the low-quality NeRF, and a fine-level residual tri-plane to be added to the coarse level is modeled as a distribution with GAN to capture potential variations in restoration. We validate RaFE on both synthetic and real cases for various restoration tasks, demonstrating superior performance in both quantitative and qualitative evaluations, surpassing other 3D restoration methods specific to single task. Please see our project website https://zkaiwu.github.io/RaFE-Project/.	This supplementary material provides additional training details and experimental results for the paper 'RaFE: Generative Radiance Fields Restoration'.	The main paper proposes a novel approach for restoring degraded images within a generative radiance field framework. This supplementary material aims to enhance the understanding and validity of the proposed method.	The supplementary material details the training process for both the overall pipeline and the pre-trained coarse NeRF. It also presents ablation studies on the impact of the residual coarse NeRF, view direction conditioning, and patch sampling strategy. Additionally, the material analyzes the method's performance on NeRF-like degradation and elaborates on the calculation of the diversity score.	Residual coarse NeRF aids in better geometry awareness and detailed rendering. View direction conditioning contributes to realistic reflections on non-Lambertian surfaces. The proposed Beta-based patch sampling strategy leads to more stable training and improved rendering quality compared to uniform sampling.	The paper doesn't discuss the computational cost of the proposed method. The paper mainly focuses on visual quality, a quantitative analysis with different degradation levels could be beneficial. Exploring the generalization of RaFE to more complex real-world scenarios with severe degradations presents an exciting avenue for future work.	generative radiance fields, image restoration, nerf, deep learning, computer vision
2404.03653 Report	CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching	Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li	Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.	CoMat, an end-to-end fine-tuning strategy for text-to-image diffusion models, improves text-image alignment by incorporating an image-to-text concept matching mechanism.	Existing diffusion models struggle to fully utilize text prompts, leading to misalignment between generated images and complex prompts.	CoMat leverages a pre-trained image captioning model to guide the diffusion model during training. It identifies missing concepts and encourages the model to revisit ignored text tokens, improving alignment. It also includes an attribute concentration module to enhance attribute binding and a fidelity preservation component to maintain image quality.	CoMat-SDXL significantly outperforms the SDXL baseline and commercial models in text-image alignment benchmarks. Both concept matching and attribute concentration modules contribute to performance gains. Using a pre-trained UNet as a discriminator effectively preserves generation fidelity during fine-tuning.	Effectively incorporating Multimodal Large Language Models (MLLMs) for finer-grained alignment and fidelity remains under-explored. Adapting CoMat to the 3D domain for improved text-to-3D generation alignment is a potential future direction.	text-to-image generation, diffusion model, text-image alignment, concept matching, attribute binding
2404.03652 Report	The More You See in 2D, the More You Perceive in 3D	Xinyang Han, Zelin Gao, Angjoo Kanazawa, Shubham Goel, Yossi Gandelsman	Humans can infer 3D structure from 2D images of an object based on past experience and improve their 3D understanding as they see more images. Inspired by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel view synthesis from an arbitrary number of unposed images. Given a few unposed images of an object, we adapt a pre-trained view-conditioned diffusion model together with the camera poses of the images via test-time fine-tuning. The adapted diffusion model and the obtained camera poses are then utilized as instance-specific priors for 3D reconstruction and novel view synthesis. We show that as the number of input images increases, the performance of our approach improves, bridging the gap between optimization-based prior-less 3D reconstruction methods and single-image-to-3D diffusion-based methods. We demonstrate our system on real images as well as standard synthetic benchmarks. Our ablation studies confirm that this adaption behavior is key for more accurate 3D understanding.	SAP3D: a system for 3D object reconstruction and novel view synthesis from an arbitrary number of unposed images, improving its performance as the number of input images increases.	Existing methods struggle to reconstruct accurate 3D from a few unposed images and cannot leverage additional views to improve. This system aims to bridge the gap between single-view and multi-view reconstruction by effectively utilizing any number of input images.	The system uses a pre-trained view-conditioned diffusion model and camera pose estimator. It first estimates coarse camera poses, then jointly fine-tunes the diffusion model on input images and refines camera poses. Finally, it performs 3D reconstruction via a NeRF and novel view synthesis by sampling the adapted diffusion model.	3D reconstruction quality (geometry and appearance) improves as the number of input views grows. Novel view synthesis accuracy increases with more input images, demonstrating improved consistency and detail. Test-time adaptation and using a large-scale dataset for camera pose estimation are crucial for performance.	Camera pose parametrization is limited by the pre-trained diffusion model. The system relies on an optimization stage when multiple input images are provided, hindering real-time applicability.	3d reconstruction, novel view synthesis, test-time adaptation, diffusion models, camera pose estimation
2404.03650 Report	OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views	Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari	Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.	This paper presents NERFCLIP, a novel neural radiance field (NeRF) based approach for open-set 3D scene segmentation, by distilling pixel-aligned CLIP features into NeRF and leveraging NeRF's view synthesis capabilities for extracting additional visual-language features from novel views.	Open-set 3D scene segmentation is crucial for robots interacting with unseen environments or AR/VR applications where training labels are scarce, as it allows segmentation of arbitrary concepts beyond pre-defined classes.	NERFCLIP utilizes a NeRF architecture to encode open-set features, trained on posed RGB images and supervised by pre-computed 2D open-set feature maps. It estimates confidence in existing features and leverages NeRF's view synthesis to generate novel views of low-confidence regions, extracting additional features.	NERFCLIP achieves state-of-the-art performance on open-vocabulary 3D segmentation, outperforming mesh-based OpenScene and NeRF-based LERF by +4.5 mIoU on the Replica dataset. The study shows NeRF-based representations are better at detecting small, long-tail objects compared to mesh-based methods. Incorporating pixel-aligned CLIP features from novel views significantly improves segmentation performance.	Many long-tail classes remain undetected, highlighting the difficulty of open-scene segmentation, especially for less frequent categories. Future work could explore alternative confidence estimation techniques and novel view selection strategies for further improvement.	open-set 3d scene segmentation, neural radiance fields (nerf), vision-language models (vlms), clip, novel view synthesis
2404.03645 Report	Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation	Shuting He, Henghui Ding	Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.	This paper presents a novel approach for referring video segmentation that decouples static and motion perception to enhance the understanding of temporal information.	Existing methods struggle to accurately segment objects in videos based on natural language descriptions, especially when motion cues are crucial for identification.	The approach decouples the input sentence into static and motion cues. It uses a hierarchical motion perception module to capture short-term and long-term motion patterns. It also employs contrastive learning to distinguish visually similar objects based on their motion.	Achieves state-of-the-art performance on five referring video segmentation datasets (MeViS, Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences). Shows significant improvement on the challenging MeViS dataset, outperforming previous methods by a large margin (9.2% in J&F). Demonstrates the effectiveness of decoupling static and motion cues, hierarchical motion perception, and contrastive learning through ablation studies.	The performance gain on datasets with less emphasis on motion is relatively small. Future work could explore incorporating more sophisticated language models for richer representation of motion cues.	referring video segmentation, motion understanding, contrastive learning, hierarchical motion perception, computer vision
2404.03635 Report	WorDepth: Variational Language Prior for Monocular Depth Estimation	Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong	Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.	WorDepth is a novel variational framework for monocular depth estimation that leverages language as a prior to improve metric-scaled depth prediction from single images.	Monocular depth estimation suffers from inherent scale ambiguity. Language descriptions, while also ambiguous in spatial layout, provide strong priors on object scales, offering complementary information to resolve this ambiguity.	The method uses a text-VAE to learn a latent distribution of plausible depth maps from text captions. An image-based conditional sampler then samples from this distribution, conditioned on the input image, to predict the most probable depth map.	WorDepth achieves state-of-the-art results on NYU Depth V2 and KITTI datasets. The method shows significant improvement in threshold accuracy (δ<1.25), indicating better scale estimation. Qualitative results demonstrate improved depth prediction accuracy for objects mentioned in text descriptions.	The method's performance relies on the accuracy and specificity of text captions, making it susceptible to inaccuracies from the image captioner. Vague or incorrect captions can lead to suboptimal depth predictions.	monocular depth estimation, language prior, variational autoencoder (vae), conditional sampler, scale ambiguity
2404.03620 Report	LCM-Lookahead for Encoder-based Text-to-Image Personalization	Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or	Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.	This paper introduces LCM-Lookahead, a novel mechanism using fast sampling models to apply image-space losses to diffusion models, leading to improved identity preservation in personalized text-to-image generation.	Existing encoder-based text-to-image personalization methods struggle to balance prompt alignment and identity preservation, especially in challenging scenarios like stylization.	The authors leverage a pretrained LCM model as a shortcut to create high-quality previews of denoised images. This allows them to apply an identity loss during encoder training, improving identity fidelity without sacrificing layout diversity or prompt alignment. Additionally, they propose using a consistent, synthetic dataset generated with SDXL-Turbo and integrating an extended self-attention mechanism into the encoder.	The proposed method achieves superior identity preservation and prompt alignment compared to the baseline IP-Adapter. Using an LCM-based shortcut for identity loss leads to noticeable improvements over direct approximations. A novel consistent dataset generated with SDXL-Turbo significantly improves prompt alignment.	The model may still exhibit biases present in the backbone model or the diffusion model itself. The performance of tuning-free encoders, while improved, still falls short of optimization-based methods, especially with out-of-domain inputs.	text-to-image personalization, face generation, diffusion models, lcm, consistency models
2404.03613 Report	Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting	Jeongmin Bae, Seoha Kim, Youngsik Yun, Hahyun Lee, Gun Bang, Youngjung Uh	As 3D Gaussian Splatting (3DGS) provides fast and high-quality novel view synthesis, it is a natural extension to deform a canonical 3DGS to multiple frames. However, previous works fail to accurately reconstruct dynamic scenes, especially 1) static parts moving along nearby dynamic parts, and 2) some dynamic areas are blurry. We attribute the failure to the wrong design of the deformation field, which is built as a coordinate-based function. This approach is problematic because 3DGS is a mixture of multiple fields centered at the Gaussians, not just a single coordinate-based framework. To resolve this problem, we define the deformation as a function of per-Gaussian embeddings and temporal embeddings. Moreover, we decompose deformations as coarse and fine deformations to model slow and fast movements, respectively. Also, we introduce an efficient training strategy for faster convergence and higher quality. Project page: https://jeongminb.github.io/e-d3dgs/	This paper introduces a novel per-Gaussian embedding-based deformation method for deformable 3D Gaussian Splatting, improving dynamic scene reconstruction.	Existing field-based deformable 3D Gaussian Splatting methods suffer from entanglement of Gaussian deformations, leading to inaccurate reconstructions of dynamic scenes.	The method defines deformation as a function of per-Gaussian and temporal embeddings, uses coarse-fine deformation to model different motion scales, and employs an efficient training strategy for faster convergence and better quality.	The approach improves deformation quality and captures fine details in dynamic regions. It outperforms baselines on datasets like Neural 3D Video and Technicolor Light Field. The method demonstrates superior reconstruction quality even under challenging camera settings, as shown with the HyperNeRF dataset.	The method may exhibit blurriness in areas with significant motion between frames. Rendering speed can be slower compared to existing Gaussian Splatting methods.	3d gaussian splatting, dynamic scene reconstruction, novel view synthesis, per-gaussian deformation, deformable neural radiance fields
2404.03575 Report	DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling	Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, Pengyuan Zhou	Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors. Despite significant progress, existing methods struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework, to tackle the aforementioned three challenges mainly via two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a multi-timestep sampling strategy guided by the formation patterns of 3D objects, to form fast, semantically rich, and high-quality representations. FPS uses 3D Gaussian filtering for optimization stability, and leverages reconstruction techniques to generate plausible textures. Second, DreamScene employs a progressive three-stage camera sampling strategy, specifically designed for both indoor and outdoor settings, to effectively ensure object-environment integration and scene-wide 3D consistency. Last, DreamScene enhances scene editing flexibility by integrating objects and environments, enabling targeted adjustments. Extensive experiments validate DreamScene's superiority over current state-of-the-art techniques, heralding its wide-ranging potential for diverse applications. Code and demos will be released at https://dreamscene-project.github.io .	DreamScene, a novel text-to-3D scene generation framework, leverages 3D Gaussians and Formation Pattern Sampling (FPS) to achieve high quality, consistency, and editing flexibility.	Existing methods struggle with maintaining quality and consistency across viewpoints and lack flexibility in editing generated scenes.	DreamScene uses FPS, a multi-timestep sampling method with 3D Gaussian filtering, to create semantically rich representations and plausible textures. A three-stage camera sampling strategy ensures scene-wide consistency. Objects and environments are integrated for flexible editing.	DreamScene generates high-quality 3D scenes and objects comparable to or exceeding state-of-the-art methods. It achieves superior scene-wide 3D consistency compared to methods like Text2NeRF and Text2Room. DreamScene allows for flexible editing of object placement and style within the generated scene.	Outdoor scene realism is currently limited compared to some inpainting-based methods. Future work will explore depth supervision to enhance realism in outdoor scene generation.	text-to-3d, scene generation, 3d gaussian, formation pattern sampling, scene editing
2404.03566 Report	PointInfinity: Resolution-Invariant Point Diffusion Models	Zixuan Huang, Justin Johnson, Shoubhik Debnath, James M. Rehg, Chao-Yuan Wu	We present PointInfinity, an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size, resolution-invariant latent representation. This enables efficient training with low-resolution point clouds, while allowing high-resolution point clouds to be generated during inference. More importantly, we show that scaling the test-time resolution beyond the training resolution improves the fidelity of generated point clouds and surfaces. We analyze this phenomenon and draw a link to classifier-free guidance commonly used in diffusion models, demonstrating that both allow trading off fidelity and variability during inference. Experiments on CO3D show that PointInfinity can efficiently generate high-resolution point clouds (up to 131k points, 31 times more than Point-E) with state-of-the-art quality.	This paper introduces PointInfinity, a resolution-invariant point cloud diffusion model that can be trained on low-resolution point clouds but generate high-resolution point clouds during inference.	Current 3D point cloud diffusion models struggle to achieve the realism and diversity seen in 2D image generation due to the computational challenges posed by the large size of point cloud data.	PointInfinity uses a two-stream transformer architecture with a fixed-size latent representation for modeling the underlying 3D shape and a variable-sized data representation for handling point clouds of different resolutions. This allows for efficient training on low-resolution data while enabling high-resolution generation.	Scaling the test-time resolution beyond the training resolution improves the fidelity of the generated point clouds. PointInfinity outperforms previous state-of-the-art methods in terms of surface generation fidelity and texture quality. PointInfinity demonstrates significant computational efficiency, scaling linearly with input resolution during inference compared to the quadratic scaling of previous methods.	The paper focuses on generating point clouds up to a specific resolution and does not explore generation at arbitrarily high resolutions. Future work could explore incorporating other conditioning signals, such as text, to enable more controllable point cloud generation.	point cloud generation, diffusion models, resolution invariance, transformer, 3d deep learning
2404.03531 Report	COMO: Compact Mapping and Odometry	Eric Dexheimer, Andrew J. Davison	We present COMO, a real-time monocular mapping and odometry system that encodes dense geometry via a compact set of 3D anchor points. Decoding anchor point projections into dense geometry via per-keyframe depth covariance functions guarantees that depth maps are joined together at visible anchor points. The representation enables joint optimization of camera poses and dense geometry, intrinsic 3D consistency, and efficient second-order inference. To maintain a compact yet expressive map, we introduce a frontend that leverages the covariance function for tracking and initializing potentially visually indistinct 3D points across frames. Altogether, we introduce a real-time system capable of estimating accurate poses and consistent geometry.	COMO, a real-time monocular SLAM system that encodes dense geometry via a compact set of 3D anchor points and uses per-keyframe depth covariance functions for 3D consistency.	Provides accurate and consistent poses and dense geometry for robotics and AR using the simplicity and efficiency of monocular cameras.	A compact set of 3D points is projected into keyframes. Depth covariance functions generate dense depth maps conditioned on the sparse 3D points. Poses and 3D points are jointly optimized by minimizing photometric error.	Outperforms state-of-the-art dense SLAM methods on TUM RGBD in terms of trajectory error. Achieves the lowest ATE and highest AUC on ScanNet trajectory estimation among both sparse and dense methods. Produces the most accurate depth maps on Replica and ScanNet while also demonstrating strong depth consistency across neighboring frames.	Reliance on photometric error can limit accuracy in scenes with low texture, specularities, and dynamic lighting. The current depth covariance model was trained on ScanNet, and may not generalize to different environments.	slam, dense geometry, depth covariance, monocular vision, 3d reconstruction
2404.03477 Report	Towards Automated Movie Trailer Generation	Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem	Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating trailers can be time-consuming and expensive. To streamline this process, we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation techniques and models the movies and trailers as sequences of shots, thus formulating the trailer generation problem as a sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT), a deep-learning framework utilizing an encoder-decoder architecture. TGT movie encoder is tasked with contextualizing each movie shot representation via self-attention, while the autoregressive trailer decoder predicts the feature representation of the next trailer shot, accounting for the relevance of shots' temporal order in trailers. Our TGT significantly outperforms previous methods on a comprehensive suite of metrics.	This paper proposes Trailer Generation Transformer (TGT), a novel deep-learning framework for automatic movie trailer generation by formulating it as a sequence-to-sequence learning problem, addressing limitations of prior shot classification or ranking based approaches.	Movie trailers are crucial for marketing but costly and time-consuming to create manually. Automating trailer generation can significantly streamline this process, benefiting both studios and audiences.	TGT utilizes an encoder-decoder architecture. The encoder contextualizes movie shots using a trailerness encoder and a context encoder. The decoder autoregressively predicts trailer shot features, learning shot composition. A greedy algorithm then selects optimal shots from the movie based on predicted features.	TGT significantly outperforms baselines on two new benchmarks built upon MAD and MovieNet datasets, using metrics like precision, recall, F1-score, Levenshtein distance, and sequence length difference. Ablation studies demonstrate the importance of both trailerness and context encoders, as well as the chosen loss functions (reconstruction, trailerness, and KL divergence). Adding text-based movie plot summaries as contextual input further improves the trailer generation performance.	The current TGT model does not incorporate dialogue and sound modeling, which are important for fine-grained trailer editing. Future work can explore incorporating these elements and further enhance TGT's capabilities.	trailer generation, sequence-to-sequence learning, transformer, video understanding, deep learning
2404.03421 Report	Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View	Andreea Dogaru, Mert Özer, Bernhard Egger	Single-view 3D reconstruction is currently approached from two dominant perspectives: reconstruction of scenes with limited diversity using 3D data supervision or reconstruction of diverse singular objects using large image priors. However, real-world scenarios are far more complex and exceed the capabilities of these methods. We therefore propose a hybrid method following a divide-and-conquer strategy. We first process the scene holistically, extracting depth and semantic information, and then leverage a single-shot object-level method for the detailed reconstruction of individual components. By following a compositional processing approach, the overall framework achieves full reconstruction of complex 3D scenes from a single image. We purposely design our pipeline to be highly modular by carefully integrating specific procedures for each processing step, without requiring an end-to-end training of the whole system. This enables the pipeline to naturally improve as future methods can replace the individual modules. We demonstrate the reconstruction performance of our approach on both synthetic and real-world scenes, comparing favorable against prior works. Project page: https://andreeadogaru.github.io/Gen3DSR.	This paper proposes a novel modular framework for generalizable 3D scene reconstruction from a single RGB image, leveraging a divide-and-conquer strategy with off-the-shelf models for depth estimation, entity segmentation, and object reconstruction.	Reconstructing complex real-world scenes from a single view is a challenging task with significant implications for various applications. Existing methods are often limited in scope or generalization ability. This work addresses these limitations by proposing a more versatile and robust approach.	The proposed method analyzes the input scene holistically, extracting depth, camera parameters, and segmenting entities. Foreground objects are processed individually, undergoing amodal completion and single-view reconstruction before being integrated into the final scene using depth guides. The background is modeled separately.	The method achieves state-of-the-art results on benchmark datasets, outperforming existing approaches in terms of accuracy and visual quality. The modular design allows for incremental improvements as individual components can be easily replaced with more advanced models in the future. The framework exhibits strong generalization ability, effectively reconstructing scenes with diverse objects and layouts, even on real-world images.	The method relies on accurate depth estimation, and errors in depth can propagate to the final reconstruction. The current implementation relies on a separate object reconstruction model that may not generalize to unseen object categories.	3d scene reconstruction, single-view, compositional, amodal completion, depth estimation
2404.03413 Report	MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny	This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here https://vision-cair.github.io/MiniGPT4-video/	This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding, capable of processing both temporal visual and textual data to understand and answer queries about videos.	Existing LLMs struggle to effectively capture temporal information from videos, limiting their ability to understand dynamic visual content. This work aims to adapt LLMs for comprehending the complexities of video sequences, combining visual and textual information for enhanced understanding.	The methodology involves subsampling video frames, aligning them with textual descriptions using a pretrained EVA-CLIP model, and mapping them into the LLM space. By concatenating visual and text tokens for each frame, the LLM gains a comprehensive understanding of the video's content. The model is trained in three stages: image-text pair pretraining, video-text pair pretraining, and video question answering instruction finetuning.	MiniGPT4-Video outperforms previous state-of-the-art methods on the Video-ChatGPT benchmark across all five evaluation dimensions when subtitles are provided. The model achieves significant improvements in zero-shot evaluations for open-ended questions on MSVD, MSRVTT, TGIF, and ActivityNet datasets, demonstrating its ability to effectively answer questions based on visual content. Integrating subtitle information significantly boosts performance on the TVQA benchmark for multiple-choice questions, highlighting the model's capacity to leverage both visual and textual cues for enhanced video understanding.	MiniGPT4-Video currently faces limitations in processing long videos due to the LLM's context window constraint. Future work will focus on extending the model to handle longer video sequences, addressing this limitation.	large language models, video understanding, multimodal learning, video question answering, temporal information processing
2404.03407 Report	AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment	Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai	With the rapid advancements in AI-Generated Content (AIGC), AI-Generated Images (AIGIs) have been widely applied in entertainment, education, and social media. However, due to the significant variance in quality among different AIGIs, there is an urgent need for models that consistently match human subjective ratings. To address this issue, we organized a challenge towards AIGC quality assessment on NTIRE 2024 that extensively considers 15 popular generative models, utilizing dynamic hyper-parameters (including classifier-free guidance, iteration epochs, and output image resolution), and gather subjective scores that consider perceptual quality and text-to-image alignment altogether comprehensively involving 21 subjects. This approach culminates in the creation of the largest fine-grained AIGI subjective quality database to date with 20,000 AIGIs and 420,000 subjective ratings, known as AIGIQA-20K. Furthermore, we conduct benchmark experiments on this database to assess the correspondence between 16 mainstream AIGI quality models and human perception. We anticipate that this large-scale quality database will inspire robust quality indicators for AIGIs and propel the evolution of AIGC for vision. The database is released on https://www.modelscope.cn/datasets/lcysyzxdxc/AIGCQA-30K-Image.	This paper introduces AIGIQA-20K, the largest fine-grained database for AI-Generated Image Quality Assessment.	A standardized quality assessment metric for AI-generated images is crucial due to the increasing prevalence of AIGIs and the variability in their quality.	The database was constructed by generating 20,000 AIGIs using 15 T2I models with varying hyperparameters (CFG, iterations, resolution). 21 subjects then rated the perceptual quality and text-to-image alignment of each image, resulting in 420,000 subjective scores.	AIGI quality is significantly impacted by the T2I model, prompt, and hyperparameters. Fine-tuning traditional IQA metrics on AIGIQA-20K considerably improves their performance, with some surpassing zero-shot alignment metrics. Zero-shot quality assessment models for AIGIs still require further development, as indicated by the superior performance of fine-tuned models and the leading accuracy of qalign despite its disregard for text-image alignment.	The study primarily focuses on image quality and does not encompass other AIGC modalities like video, text, or audio. Future work could investigate the development of more robust zero-shot quality assessment models for AIGIs that can generalize across different generation models and hyperparameters without requiring fine-tuning.	ai-generated images, image quality assessment, text-to-image synthesis, subjective quality evaluation, aigiqa-20k
2404.03392 Report	Two Tricks to Improve Unsupervised Segmentation Learning	Alp Eren Sari, Francesco Locatello, Paolo Favaro	We present two practical improvement techniques for unsupervised segmentation learning. These techniques address limitations in the resolution and accuracy of predicted segmentation maps of recent state-of-the-art methods. Firstly, we leverage image post-processing techniques such as guided filtering to refine the output masks, improving accuracy while avoiding substantial computational costs. Secondly, we introduce a multi-scale consistency criterion, based on a teacher-student training scheme. This criterion matches segmentation masks predicted from regions of the input image extracted at different resolutions to each other. Experimental results on several benchmarks used in unsupervised segmentation learning demonstrate the effectiveness of our proposed techniques.	This paper introduces two practical tricks to enhance the resolution and accuracy of predicted segmentation maps in unsupervised segmentation learning, specifically addressing limitations in recent state-of-the-art methods.	Unsupervised segmentation learning has the potential to be scaled to very large datasets and multiple imaging modalities with limited human effort but current methods are limited in either resolution or rely on complex training schemes.	1. Guided Filtering Post-Processing: Refine output masks using guided filtering with input image luminance as guidance, improving accuracy without significant computational overhead. 2. Multi-Scale Consistency Criterion: Employ a teacher-student training scheme where the teacher network operates on zoomed-in image regions. The student network processes the whole image, and its predictions for corresponding regions are matched with the teacher's output, enhancing detail.	Achieved state-of-the-art (SotA) results in unsupervised saliency segmentation on DUT-OMRON, DUTS-TE, and ECSSD datasets. Introduced two novel and general techniques to enhance resolution of segmentation masks, demonstrating computational efficiency. Showed consistent improvement in segmentation performance across different backbones and when combined with other recent methods.	Challenges arise in scenarios with unambiguous saliency or multiple objects, particularly when the background and foreground share visual similarities. Future work could explore extensions for multi-object segmentation and refine object selection mechanisms in complex scenes.	unsupervised learning, segmentation, self-supervised learning, guided filtering, multi-scale consistency
2404.03349 Report	VF-NeRF: Viewshed Fields for Rigid NeRF Registration	Leo Segre, Shai Avidan	3D scene registration is a fundamental problem in computer vision that seeks the best 6-DoF alignment between two scenes. This problem was extensively investigated in the case of point clouds and meshes, but there has been relatively limited work regarding Neural Radiance Fields (NeRF). In this paper, we consider the problem of rigid registration between two NeRFs when the position of the original cameras is not given. Our key novelty is the introduction of Viewshed Fields (VF), an implicit function that determines, for each 3D point, how likely it is to be viewed by the original cameras. We demonstrate how VF can help in the various stages of NeRF registration, with an extensive evaluation showing that VF-NeRF achieves SOTA results on various datasets with different capturing approaches such as LLFF and Objaverese.	This paper introduces Viewshed Fields (VF), a novel implicit function that identifies 3D points likely to be seen by the original cameras, for rigid registration of Neural Radiance Fields (NeRFs) without known camera positions.	NeRF registration is crucial for various applications like 3D scene understanding and reconstruction, but existing methods struggle with finding good camera viewpoints for alignment.	The method uses Normalizing Flows to learn a mapping between oriented points (location and viewing direction) and a latent Gaussian distribution during NeRF training. This enables sampling high-visibility points to generate novel views and guide registration optimization.	VF-NeRF achieves state-of-the-art results on LLFF, casually captured scenes, and Objaverse datasets, outperforming point cloud and other NeRF registration methods. VF-based initialization using point clouds or photometric scores significantly improves registration accuracy. VF-NeRF demonstrates robustness to noise in oriented point positions, indicating its ability to generalize well for novel view generation.	VF-NeRF's reliance on photometric loss can pose challenges in textureless or symmetric scenes. Future work could explore the application of VF-NeRF to dynamic scenes and improve handling of partial overlaps.	neural radiance fields, 3d registration, normalizing flows, novel view synthesis, viewshed fields
2404.03242 Report	Would Deep Generative Models Amplify Bias in Future Models?	Tianwei Chen, Yusuke Hirota, Mayu Otani, Noa Garcia, Yuta Nakashima	We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplification, would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models, which we evaluate in terms of quality and bias. Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead, instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena, such as artifacts in image generation (e.g., blurry faces) or pre-existing biases in the original datasets.	This paper investigates the impact of using synthetic images generated by deep generative models, specifically Stable Diffusion, on the potential for social bias amplification in future computer vision models.	With the increasing prevalence of AI-generated images online, it's crucial to understand how their inherent biases might affect the training of future models and potentially lead to a harmful feedback loop.	The authors progressively replaced original images in the COCO and CC3M datasets with Stable Diffusion generated images. They then trained OpenCLIP and image captioning models on these modified datasets and evaluated their performance and bias.	Introducing generated images during training did not consistently amplify bias as initially hypothesized. Bias mitigation was observed in specific tasks, particularly related to gender. The impact of generated images on bias varied, showing amplification, mitigation, no effect, or ambiguous trends depending on the task and the type of bias.	The experiments were limited to moderately sized datasets (COCO and CC3M) due to computational constraints, leaving the impact on larger datasets uncertain. Only Stable Diffusion was used for image generation, potentially overlooking insights from other generative models.	social bias, deep generative models, dataset contamination, computer vision, stable diffusion
2404.03214 Report	LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity	Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne	Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad.	LeGrad, a novel gradient-based explainability method specifically designed for Vision Transformers (ViTs), leverages the gradient information with respect to the attention maps to provide insights into the model's decision-making process.	Existing explainability methods designed for convolutional or feed-forward neural networks are not directly applicable to ViTs due to their architectural differences, making it crucial to develop methods specifically tailored for ViTs.	LeGrad computes the gradient of the target class activation with respect to the attention maps of each ViT layer. It then aggregates these layer-wise gradients, discarding negative contributions and normalizing the result, to generate a final explainability heatmap.	LeGrad outperforms state-of-the-art explainability methods in object segmentation tasks, achieving a mean Intersection over Union (mIoU) of 58.7% on the ImageNet-Segmentation dataset. In open-vocabulary scenarios, LeGrad demonstrates superior performance on the OpenImagesV7 dataset, with performance gains ranging from 2x to 5x compared to other methods. LeGrad exhibits robustness and adaptability across different ViT architectures, including those with attentional poolers like SigLIP, effectively highlighting relevant image regions for a wide range of object categories and concepts.	The authors acknowledge the potential presence of biases and sensitive content within the datasets used to train the ViTs, emphasizing the need for ethical considerations in future research. Future work could explore extensions of LeGrad to other transformer-based architectures beyond ViTs, further broadening its applicability in the field of explainable AI.	explainable ai, vision transformers, attention mechanisms, image segmentation, open-vocabulary object detection
2404.03202 Report	OmniGS: Omnidirectional Gaussian Splatting for Fast Radiance Field Reconstruction using Omnidirectional Images	Longwei Li, Huajian Huang, Sai-Kit Yeung, Hui Cheng	Photorealistic reconstruction relying on 3D Gaussian Splatting has shown promising potential in robotics. However, the current 3D Gaussian Splatting system only supports radiance field reconstruction using undistorted perspective images. In this paper, we present OmniGS, a novel omnidirectional Gaussian splatting system, to take advantage of omnidirectional images for fast radiance field reconstruction. Specifically, we conduct a theoretical analysis of spherical camera model derivatives in 3D Gaussian Splatting. According to the derivatives, we then implement a new GPU-accelerated omnidirectional rasterizer that directly splats 3D Gaussians onto the equirectangular screen space for omnidirectional image rendering. As a result, we realize differentiable optimization of the radiance field without the requirement of cube-map rectification or tangent-plane approximation. Extensive experiments conducted in egocentric and roaming scenarios demonstrate that our method achieves state-of-the-art reconstruction quality and high rendering speed using omnidirectional images. To benefit the research community, the code will be made publicly available once the paper is published.	This paper introduces OmniGS, a novel system leveraging omnidirectional Gaussian Splatting for fast and efficient radiance field reconstruction from omnidirectional images.	Current 3D Gaussian Splatting methods are limited to perspective images. This work enables the use of information-rich omnidirectional images for real-time, high-fidelity 3D scene reconstruction, which is crucial for robotics applications.	The authors theoretically analyze spherical camera model derivatives for 3D Gaussian splatting. Based on this, they develop a GPU-accelerated rasterizer directly splatting 3D Gaussians onto equirectangular images, enabling differentiable optimization without cube-map rectification or tangent-plane approximations.	OmniGS achieves state-of-the-art photorealistic reconstruction quality on both roaming and egocentric scenes, outperforming NeRF-based methods. OmniGS exhibits superior rendering speed, significantly faster than NeRF-based approaches and previous 3D Gaussian Splatting methods. The method demonstrates strong scalability by producing high-quality perspective views from rendered omnidirectional images, surpassing perspective 3DGS in quality.	The current implementation ignores the periodicity of trigonometric functions for speed, potentially limiting quality, which can be addressed in future work. Future work can explore integrating OmniGS with omnidirectional SLAM systems for real-time simultaneous localization and photorealistic mapping in robotics.	omnidirectional vision, photorealistic mapping, 3d reconstruction, view synthesis, gaussian splatting
2404.03109 Report	Many-to-many Image Generation with Auto-regressive Diffusion Models	Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M. Susskind, Jiatao Gu	Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and generating an arbitrary number of interrelated images within a broad context. This limitation becomes increasingly critical as the demand for multi-image scenarios, such as multi-view images and visual narratives, grows with the expansion of multimedia platforms. This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios. To facilitate this, we present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. Utilizing Stable Diffusion with varied latent noises, our method produces a set of interconnected images from a single caption. Leveraging MIS, we learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework. Throughout training on the synthetic MIS, the model excels in capturing style and content from preceding images - synthetic or real - and generates novel images following the captured patterns. Furthermore, through task-specific fine-tuning, our model demonstrates its adaptability to various multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation.	This paper introduces a novel domain-general framework for many-to-many image generation, enabling the production of interrelated image series from a given set of images.	Existing image generation models often struggle to perceive and generate an arbitrary number of interconnected images within a broader context, limiting their application in multi-image scenarios like multi-view synthesis and visual narratives. This paper addresses this limitation by proposing a scalable solution applicable across different multi-image generation tasks.	The authors propose Many-to-Many Diffusion Model (M2M-DM), an autoregressive model trained on a new large-scale synthetic multi-image dataset named M2M-ImageSet. They explore two architectural variants: SD-M2M, which processes images using a shared U-Net, and DINO-M2M, which leverages DINOv2 for enhanced encoding of preceding images.	M2M-DM demonstrates the ability to capture and maintain both content and style consistency across generated image sequences. The model exhibits strong zero-shot generalization capabilities, effectively generating coherent images even when conditioned on real-world images from the MSCOCO dataset, despite being trained solely on synthetic data. Through task-specific fine-tuning, M2M-DM adapts to diverse multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation, highlighting its versatility and potential for broader application.	The model currently faces limitations in generating high-fidelity human faces, potentially due to limitations in the training dataset. A decline in image quality is observed during the generation of prolonged image sequences, suggesting an area for future optimization.	image generation, diffusion models, multi-image generation, autoregressive models, novel view synthesis
2404.03042 Report	AWOL: Analysis WithOut synthesis using Language	Silvia Zuffi, Michael J. Black	Many classical parametric 3D shape models exist, but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mapping between the latent space of a vision-language model and the parameter space of the 3D model, which we do using a small set of shape and text pairs. Our hypothesis is that mapping from language to parameters allows us to generate parameters for objects that were never seen during training. If the mapping between language and parameters is sufficiently smooth, then interpolation or generalization in language should translate appropriately into novel 3D shapes. We test our approach with two very different types of parametric shape models (quadrupeds and arboreal trees). We use a learned statistical shape model of quadrupeds and show that we can use text to generate new animals not present during training. In particular, we demonstrate state-of-the-art shape estimation of 3D dogs. This work also constitutes the first language-driven method for generating 3D trees. Finally, embedding images in the CLIP latent space enables us to generate animals and trees directly from images.	Presents AWOL, a method leveraging language to generate novel 3D shapes from existing parametric models, achieving generalization beyond training data by mapping between vision-language models and model parameters.	Addresses the limitations of classical parametric 3D shape models requiring expert knowledge for novel shape creation, enabling easy generation of new shapes (e.g., specific tree types, animal breeds) using language.	Learns a mapping between CLIP's latent space and parameters of 3D models (e.g., SMAL for animals, TreeGen for trees) using a small dataset of shape and text pairs. Employs a RealNVP model with learned binary masks and a reconstruction loss for training.	Generates new dog breeds by interpolating in the shape space, demonstrating realistic variations in size and age. Produces novel animal and tree species not present in the training data, showcasing generalization capabilities. Enables 3D shape generation from both text prompts and images, providing flexible control over the models.	Limited to the diversity of the initial training data for the 3D models. Primarily explores qualitative evaluation for generated shapes, lacking extensive quantitative metrics.	text-to-3d, 3d shape generation, vision-language models, parametric models, clip
2404.02948 Report	PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models	Fanxu Meng, Zhaohui Wang, Muhan Zhang	As the parameters of LLMs expand, the computational cost of fine-tuning the entire model becomes prohibitive. To address this challenge, we introduce a PEFT method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning. PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix W within the model by the product of two trainable matrices A and B, plus a residual matrix $W^{res}$ for error correction. SVD is employed to factorize W, and the principal singular values and vectors of W are utilized to initialize A and B. The residual singular values and vectors initialize the residual matrix $W^{res}$, which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with LoRA. However, LoRA approximates Delta W through the product of two matrices, A, initialized with Gaussian noise, and B, initialized with zeros, while PiSSA initializes A and B with principal singular values and vectors of the original matrix W. PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the "noisy" parts. In comparison, LoRA freezes the original matrix and updates the "noise". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.	This paper introduces PiSSA, a novel parameter-efficient fine-tuning (PEFT) method that leverages principal singular values and vectors for adapter initialization, outperforming existing techniques like LoRA.	Fine-tuning large language models (LLMs) is computationally expensive. PiSSA addresses this by significantly reducing the number of trainable parameters while maintaining or exceeding the performance of full-parameter fine-tuning.	PiSSA utilizes singular value decomposition (SVD) to extract principal components from pre-trained weight matrices. These components initialize adapters, effectively capturing essential model capabilities for efficient fine-tuning.	PiSSA consistently outperforms LoRA in fine-tuning across various tasks, models (LLaMA 2-7B, Mistral-7B, Gemma-7B), and datasets (MetaMathQA, CodeFeedback, WizardLM). When combined with quantization techniques, PiSSA significantly reduces quantization errors compared to QLoRA and LoftQ, further improving efficiency. Experiments with different ranks demonstrate that PiSSA converges faster and achieves better performance with fewer trainable parameters than LoRA, highlighting its superior efficiency.	Further investigation is needed to assess PiSSA's performance on a broader range of tasks and larger models. Future work includes exploring the combination of PiSSA with LoRA successors and providing a theoretical explanation for its advantages.	parameter-efficient fine-tuning, large language models, singular value decomposition, low-rank adaptation, quantization
2404.02905 Report	Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction	Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang	We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.	This paper introduces Visual AutoRegressive (VAR) modeling, a novel image generation paradigm that redefines autoregressive learning on images as “next-scale prediction” or “next-resolution prediction”, departing from the conventional “next-token prediction” approach.	Existing autoregressive image generation models, while theoretically sound, suffer from limitations such as violation of unidirectional dependencies, disruption of spatial locality, and inefficiency. This hinders their performance and scalability compared to diffusion models. VAR addresses these limitations, enabling autoregressive models to surpass diffusion models in image generation for the first time.	VAR employs a multi-scale approach. It first encodes an image into multi-scale token maps using a novel multi-scale VQVAE. Then, a GPT-style transformer generates images autoregressively from coarse to fine scales, predicting the next higher-resolution token map at each step, conditioned on all previous scales.	VAR achieves state-of-the-art FID/IS scores on ImageNet 256x256 and 512x512 benchmarks, outperforming both traditional autoregressive models and diffusion models, including DiT. VAR exhibits strong power-law scaling laws similar to LLMs, demonstrating consistent performance improvements with increased model size and computational budget. VAR shows promising zero-shot generalization capabilities for downstream tasks such as image in-painting, out-painting, and editing.	The study primarily focuses on the learning paradigm, with the VQVAE architecture and training adopted from a baseline. Exploring advanced VQVAE architectures could further enhance VAR's performance. While demonstrating promise in zero-shot generalization, the current study doesn't delve into text-prompt-based generation. Extending VAR for text-to-image synthesis is a high priority.	image generation, autoregressive models, visual transformers, scaling laws, zero-shot learning
2404.02889 Report	Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining	Qi Cui, Ruohan Meng, Chaohui Xu, Chip-Hong Chang	Ensuring the legal usage of deep models is crucial to promoting trustable, accountable, and responsible artificial intelligence innovation. Current passport-based methods that obfuscate model functionality for license-to-use and ownership verifications suffer from capacity and quality constraints, as they require retraining the owner model for new users. They are also vulnerable to advanced Expanded Residual Block ambiguity attacks. We propose Steganographic Passport, which uses an invertible steganographic network to decouple license-to-use from ownership verification by hiding the user's identity images into the owner-side passport and recovering them from their respective user-side passports. An irreversible and collision-resistant hash function is used to avoid exposing the owner-side passport from the derived user-side passports and increase the uniqueness of the model signature. To safeguard both the passport and model's weights against advanced ambiguity attacks, an activation-level obfuscation is proposed for the verification branch of the owner's model. By jointly training the verification and deployment branches, their weights become tightly coupled. The proposed method supports agile licensing of deep models by providing a strong ownership proof and license accountability without requiring a separate model retraining for the admission of every new user. Experiment results show that our Steganographic Passport outperforms other passport-based deep model protection methods in robustness against various known attacks.	This paper proposes Steganographic Passport, a novel method for protecting deep model intellectual property (IP) that allows verification of both model ownership and individual user licenses without retraining.	Current passport-based methods for deep model protection require retraining for each new user, limiting their scalability and practicality for licensing scenarios.	The method uses an invertible steganographic network to hide user IDs in user-side passports, decoupling license verification from ownership verification. It also employs activation-level obfuscation and a balance loss function to enhance security against attacks.	The method achieves high accuracy in both ownership and license verification. It exhibits strong robustness against various attacks, including ownership ambiguity attacks, license ambiguity attacks, and removal attacks. Experimental results demonstrate its superior performance compared to existing passport-based methods.	The impact of the choice of activation function on the method's security requires further investigation. Exploring more advanced steganography techniques to enhance the hiding capacity and imperceptibility of user IDs in passports is a promising direction.	deep model protection, intellectual property, steganography, passport-based verification, license verification
2404.02883 Report	On the Scalability of Diffusion-based Text-to-Image Generation	Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto	Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.	This paper investigates the scaling properties of diffusion-based text-to-image models, focusing on the impact of scaling denoising backbones (UNet and Transformer) and training datasets on model performance.	Understanding how to effectively scale these models is crucial for improving image generation quality, text-image alignment, and training efficiency.	The authors conducted extensive, controlled experiments, training various UNet and Transformer architectures (ranging from 0.4B to 4B parameters) on datasets of up to 600M images. They rigorously ablated model architectures and dataset properties, evaluating performance with metrics like TIFA, ImageReward, CLIP score, FID, and HPSv2.	The design of the denoising backbone significantly influences the performance, with SDXL's UNet outperforming others. Increasing transformer blocks in UNet is more parameter-efficient than increasing channel numbers for text-image alignment. Scaling the training data with synthetic captions improves image quality and speeds up convergence. Larger, well-designed models benefit more from increased dataset size. The study provides scaling functions that predict text-image alignment performance based on model size, compute, and dataset size, demonstrating power-law relationships similar to those observed in LLMs.	Training Transformers from scratch for image generation is challenging due to the lack of inductive bias compared to UNets, suggesting further research in this area. While the study focuses on scaling existing architectures, exploring novel architectural designs could further improve scaling efficiency.	text-to-image synthesis, diffusion models, unet, transformer, scaling laws
2404.02790 Report	MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation	Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, Sarah Parisot	Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/.	This paper introduces MuLAn, a novel dataset of over 44K images with multi-layer RGBA decompositions, designed to facilitate research in compositional text-to-image generation.	Precise controllability and prompt fidelity in text-to-image generation remain challenging. The flat nature of RGB images hinders leveraging the natural instance-level compositionality of scenes. MuLAn addresses this by providing instance decomposition and occlusion information.	A novel, three-module pipeline decomposes RGB images into instance-wise RGBA stacks. The modules are: 1) Decomposition: Instance discovery and extraction using object detection, segmentation, and depth estimation. 2) Instance Completion: Reconstruction of occluded areas by leveraging depth, relative occlusion, and text-to-image inpainting. 3) Image Reassembly: Generation of occlusion-aware Alpha layers to build the final RGBA stack.	MuLAn is the first dataset of its kind, providing instance decomposition and occlusion information for a large variety of photorealistic scenes and object types. A robust, modular, and training-free pipeline is developed, capable of decomposing single RGB images into instance-wise RGBA stacks. MuLAn's potential is showcased through two applications: RGBA image generation and instance addition image editing, demonstrating superior performance compared to existing methods.	The pipeline's performance is limited by the accuracy of current object detection, segmentation, and inpainting models. Future work will focus on improving pipeline performance, increasing MuLAn's size, and exploring human-in-the-loop extensions.	text-to-image generation, image decomposition, rgba images, instance segmentation, image editing
2404.02788 Report	GenN2N: Generative NeRF2NeRF Translation	Xiangyue Liu, Han Xue, Kunming Luo, Ping Tan, Li Yi	We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/	GenN2N, a unified NeRF-to-NeRF translation framework for diverse NeRF editing tasks (text-driven editing, colorization, super-resolution, inpainting).	Existing NeRF editing methods are task-specific and lack flexibility. This work proposes a universal editing framework leveraging 2D image editing tools while maintaining 3D consistency.	The method uses a plug-and-play 2D image-to-image translator for editing. It then trains a 3D VAE-GAN model to capture the distribution of possible 3D edits from the inconsistent 2D results. Contrastive learning is used to disentangle viewpoint from editing.	GenN2N achieves state-of-the-art performance on various editing tasks, surpassing task-specific methods. The framework demonstrates good 3D consistency, generating plausible edits across different viewpoints. It shows strong generative capability, enabling diverse editing outcomes from a single input NeRF.	The reliance on 2D image editing tools might limit the complexity of achievable 3D edits. The method's performance heavily depends on the quality and consistency of the 2D translator.	nerf, nerf editing, 3d scene editing, generative models, image-to-image translation
2404.02747 Report	Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models	Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber	This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.	This paper presents \textsc{Tgate}, a training-free method that caches and reuses cross-attention outputs in text-to-image diffusion models, significantly reducing computational cost without sacrificing generation quality.	Cross-attention in diffusion models, while crucial, is computationally expensive. This work identifies redundancy in cross-attention during later inference stages, enabling substantial efficiency improvements.	The authors empirically analyze the convergence of cross-attention maps during inference. They propose \textsc{Tgate}, which caches these maps after convergence and reuses them, bypassing redundant computations.	Cross-attention maps converge to a fixed point after a few inference steps, indicating diminishing influence in later stages. \textsc{Tgate} reduces the number of Multiply–Accumulate Operations (MACs) by up to 50% and parameters by 25%, resulting in up to 2x speedup on a commercial GPU. \textsc{Tgate} maintains or even slightly improves FID scores compared to base models, demonstrating effectiveness without sacrificing generation quality.	While \textsc{Tgate} improves efficiency and FID scores, visual differences in generated images compared to baselines might be subtle. Future work could explore the optimal gate step for diverse models and prompts, potentially through adaptive mechanisms.	diffusion models, text-to-image synthesis, cross-attention, inference efficiency, computational cost
2404.02733 Report	InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation	Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, Anthony Chen	Tuning-free diffusion-based models have demonstrated significant potential in the realm of image personalization and customization. However, despite this notable progress, current models continue to grapple with several complex challenges in producing style-consistent image generation. Firstly, the concept of style is inherently underdetermined, encompassing a multitude of elements such as color, material, atmosphere, design, and structure, among others. Secondly, inversion-based methods are prone to style degradation, often resulting in the loss of fine-grained details. Lastly, adapter-based approaches frequently require meticulous weight tuning for each reference image to achieve a balance between style intensity and text controllability. In this paper, we commence by examining several compelling yet frequently overlooked observations. We then proceed to introduce InstantStyle, a framework designed to address these issues through the implementation of two key strategies: 1) A straightforward mechanism that decouples style and content from reference images within the feature space, predicated on the assumption that features within the same space can be either added to or subtracted from one another. 2) The injection of reference image features exclusively into style-specific blocks, thereby preventing style leaks and eschewing the need for cumbersome weight tuning, which often characterizes more parameter-heavy designs.Our work demonstrates superior visual stylization outcomes, striking an optimal balance between the intensity of style and the controllability of textual elements. Our codes will be available at https://github.com/InstantStyle/InstantStyle.	InstantStyle is a novel tuning-free framework for diffusion-based text-to-image models that disentangles style and content in reference images for superior style transfer, enhancing existing adapter-based methods.	Existing tuning-free methods for style transfer struggle with style degradation during inversion, content leakage, and laborious weight tuning. This work aims to address these challenges by simplifying style and content decoupling.	InstantStyle utilizes two key strategies: 1) Subtracting content text features from reference image features in CLIP space for explicit content removal. 2) Injecting image features solely into style-specific attention blocks within the diffusion model for implicit content-style disentanglement.	InstantStyle achieves visually superior style transfer with reduced content leakage compared to state-of-the-art methods. The subtraction strategy effectively mitigates content leakage but may still require manual weight tuning. Injecting features only into style blocks yields the most elegant and effective style transfer, enhancing text controllability by reducing adapter parameters.	While effective, content subtraction may still require manual weight tuning. The definition of 'style' can be subjective, necessitating further exploration for a more comprehensive representation.	style transfer, text-to-image generation, diffusion models, content-style disentanglement, clip
2404.02686 Report	Design2Cloth: 3D Cloth Generation from 2D Masks	Jiali Zheng, Rolandos Alexandros Potamias, Stefanos Zafeiriou	In recent years, there has been a significant shift in the field of digital avatar research, towards modeling, animating and reconstructing clothed human representations, as a key step towards creating realistic avatars. However, current 3D cloth generation methods are garment specific or trained completely on synthetic data, hence lacking fine details and realism. In this work, we make a step towards automatic realistic garment design and propose Design2Cloth, a high fidelity 3D generative model trained on a real world dataset from more than 2000 subject scans. To provide vital contribution to the fashion industry, we developed a user-friendly adversarial model capable of generating diverse and detailed clothes simply by drawing a 2D cloth mask. Under a series of both qualitative and quantitative experiments, we showcase that Design2Cloth outperforms current state-of-the-art cloth generative models by a large margin. In addition to the generative properties of our network, we showcase that the proposed method can be used to achieve high quality reconstructions from single in-the-wild images and 3D scans. Dataset, code and pre-trained model will become publicly available.	This paper introduces Design2Cloth, a high-fidelity 3D garment generative model trained on a large-scale real-world dataset (DigitalMe) of over 2,000 garments from 2,010 subjects.	Current 3D cloth generation methods lack realism due to being garment-specific or trained on synthetic data. Design2Cloth addresses this by using real-world data and a user-friendly approach to generate diverse and detailed clothes.	The method leverages a mask encoder and a shape encoder to learn a compact latent space for cloth representation. It employs a triplane generator to decode latent codes into unsigned distance functions, generating 3D clothes. A dual-resolution discriminator enhances detail and realism.	Design2Cloth outperforms state-of-the-art methods in generating realistic garments with high-frequency details. It enables smooth interpolation between diverse garment styles and shapes. The model allows 3D garment reconstruction from in-the-wild images and scans, outperforming baselines in accuracy and realism.	The reliance on accurate SMPL pose and shape estimation for in-the-wild reconstruction. Potential limitations in capturing the full diversity of real-world garment designs and textures.	3d garment generation, implicit neural representation, real-world cloth dataset, user-friendly design, 3d garment reconstruction
2404.02634 Report	3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization	SeungJeh Chung, JooHyun Park, Hyewon Kan, HyeongYeop Kang	3D stylization, which entails the application of specific styles to three-dimensional objects, holds significant commercial potential as it enables the creation of diverse 3D objects with distinct moods and styles, tailored to specific demands of different scenes. With recent advancements in text-driven methods and artificial intelligence, the stylization process is increasingly intuitive and automated, thereby diminishing the reliance on manual labor and expertise. However, existing methods have predominantly focused on holistic stylization, thereby leaving the application of styles to individual components of a 3D object unexplored. In response, we introduce 3DStyleGLIP, a novel framework specifically designed for text-driven, part-tailored 3D stylization. Given a 3D mesh and a text prompt, 3DStyleGLIP leverages the vision-language embedding space of the Grounded Language-Image Pre-training (GLIP) model to localize the individual parts of the 3D mesh and modify their colors and local geometries to align them with the desired styles specified in the text prompt. 3DStyleGLIP is effectively trained for 3D stylization tasks through a part-level style loss working in GLIP's embedding space, supplemented by two complementary learning techniques. Extensive experimental validation confirms that our method achieves significant part-wise stylization capabilities, demonstrating promising potential in advancing the field of 3D stylization.	Introduces 3DStyleGLIP, a novel framework for text-driven, part-tailored 3D neural stylization, allowing users to apply distinct styles to different parts of a 3D mesh based on text prompts.	Existing 3D stylization methods mainly focus on holistic stylization, limiting the ability to apply different styles to individual object components. 3DStyleGLIP addresses this limitation by enabling part-tailored stylization.	Leverages the GLIP model's vision-language embedding space to localize individual mesh parts. Trains a Neural Style Field (NSF) to modify the mesh's colors and local geometries to match the style phrases in the text prompt.	Achieves superior part-tailored stylization compared to existing 3D generation and editing methods. Demonstrates consistent and stable stylization outcomes across different random seeds. Outperforms baseline methods in user studies, showcasing better alignment with text descriptions and higher-quality stylization.	Currently limited in synthesizing parts based on abstract concepts or emotions (e.g., "delicious hamburger"). Faces challenges with stylizing objects with more than five parts or highly detailed semantic parts.	3d stylization, part-tailored stylization, text-driven manipulation, vision-language model, glip
2404.02617 Report	Neural Radiance Fields with Torch Units	Bingnan Ni, Huanyu Wang, Dongfeng Bai, Minghe Weng, Dexin Qi, Weichao Qiu, Bingbing Liu	Neural Radiance Fields (NeRF) give rise to learning-based 3D reconstruction methods widely used in industrial applications. Although prevalent methods achieve considerable improvements in small-scale scenes, accomplishing reconstruction in complex and large-scale scenes is still challenging. First, the background in complex scenes shows a large variance among different views. Second, the current inference pattern, $i.e.$, a pixel only relies on an individual camera ray, fails to capture contextual information. To solve these problems, we propose to enlarge the ray perception field and build up the sample points interactions. In this paper, we design a novel inference pattern that encourages a single camera ray possessing more contextual information, and models the relationship among sample points on each camera ray. To hold contextual information,a camera ray in our proposed method can render a patch of pixels simultaneously. Moreover, we replace the MLP in neural radiance field models with distance-aware convolutions to enhance the feature propagation among sample points from the same camera ray. To summarize, as a torchlight, a ray in our proposed method achieves rendering a patch of image. Thus, we call the proposed method, Torch-NeRF. Extensive experiments on KITTI-360 and LLFF show that the Torch-NeRF exhibits excellent performance.	This paper proposes Torch-NeRF, a novel neural radiance field method that enhances contextual information aggregation and sample point interaction for improved 3D reconstruction in complex and large-scale scenes.	Existing NeRF methods struggle to capture contextual information and handle background variance in complex scenes, particularly in autonomous driving scenarios where accurate 3D reconstruction is crucial.	Torch-NeRF employs a novel inference pattern where each camera ray renders a patch of pixels, enlarging the ray perception field. It also introduces distance-aware convolutions along rays to model relationships between sample points and improve volume smoothness.	Torch-NeRF outperforms previous methods on KITTI-360 and LLFF datasets in terms of PSNR and SSIM, demonstrating its effectiveness in complex scenes. The method effectively handles noisy colors and preserves object shapes at scene edges, as shown in qualitative comparisons. Ablation studies validate the contribution of each proposed module, including enlarged ray perception field, distance-aware convolutions, and structural similarity loss.	The current implementation discards rendered pixels in a patch except for the center, impacting rendering time. Future work aims to improve the rendering quality of all patch pixels. Further research will focus on enhancing rendering efficiency while maintaining high visual quality.	neural radiance fields, 3d reconstruction, autonomous driving, distance-aware convolutions, ray perception field
2404.02514 Report	Freditor: High-Fidelity and Transferable NeRF Editing by Frequency Decomposition	Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang	This paper enables high-fidelity, transferable NeRF editing by frequency decomposition. Recent NeRF editing pipelines lift 2D stylization results to 3D scenes while suffering from blurry results, and fail to capture detailed structures caused by the inconsistency between 2D editings. Our critical insight is that low-frequency components of images are more multiview-consistent after editing compared with their high-frequency parts. Moreover, the appearance style is mainly exhibited on the low-frequency components, and the content details especially reside in high-frequency parts. This motivates us to perform editing on low-frequency components, which results in high-fidelity edited scenes. In addition, the editing is performed in the low-frequency feature space, enabling stable intensity control and novel scene transfer. Comprehensive experiments conducted on photorealistic datasets demonstrate the superior performance of high-fidelity and transferable NeRF editing. The project page is at \url{https://aigc3d.github.io/freditor}.	This paper proposes Freditor, a novel approach for high-fidelity and transferable NeRF editing that leverages frequency decomposition.	Existing NeRF editing methods often produce blurry results or lack transferability, limiting their practical applications. Freditor addresses these limitations by decomposing appearance into low and high-frequency components and performing editing in the feature space.	Freditor uses a two-branch architecture: a high-frequency branch reconstructs detailed scenes with standard NeRF, while a low-frequency branch performs style editing in the feature space. The method utilizes low-pass filtering, feature-space stylization, and a shared decoder to combine edited low-frequency components with original high-frequency details.	Freditor achieves high-fidelity editing by preserving details through frequency decomposition, surpassing previous methods in visual quality. The feature-space editing allows for controllable stylization intensity during inference, enabling dynamic adjustments without retraining. The trained stylization modules are transferable to new scenes without retraining, enabling efficient editing of diverse 3D content.	The blending of high-frequency details may sometimes conflict with the target style, requiring more intelligent blending strategies. Further exploration of different low-frequency filter levels and their impact on editing effectiveness and artifact generation is warranted.	nerf editing, frequency decomposition, style transfer, 3d scene manipulation, generative models
2404.02410 Report	TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Surrounding Autonomous Driving Scenes	Cheng Zhao, Su Sun, Ruoyu Wang, Yuliang Guo, Jun-Jun Wan, Zhou Huang, Xinyu Huang, Yingjie Victor Chen, Liu Ren	Most 3D Gaussian Splatting (3D-GS) based methods for urban scenes initialize 3D Gaussians directly with 3D LiDAR points, which not only underutilizes LiDAR data capabilities but also overlooks the potential advantages of fusing LiDAR with camera data. In this paper, we design a novel tightly coupled LiDAR-Camera Gaussian Splatting (TCLC-GS) to fully leverage the combined strengths of both LiDAR and camera sensors, enabling rapid, high-quality 3D reconstruction and novel view RGB/depth synthesis. TCLC-GS designs a hybrid explicit (colorized 3D mesh) and implicit (hierarchical octree feature) 3D representation derived from LiDAR-camera data, to enrich the properties of 3D Gaussians for splatting. 3D Gaussian's properties are not only initialized in alignment with the 3D mesh which provides more completed 3D shape and color information, but are also endowed with broader contextual information through retrieved octree implicit features. During the Gaussian Splatting optimization process, the 3D mesh offers dense depth information as supervision, which enhances the training process by learning of a robust geometry. Comprehensive evaluations conducted on the Waymo Open Dataset and nuScenes Dataset validate our method's state-of-the-art (SOTA) performance. Utilizing a single NVIDIA RTX 3090 Ti, our method demonstrates fast training and achieves real-time RGB and depth rendering at 90 FPS in resolution of 1920x1280 (Waymo), and 120 FPS in resolution of 1600x900 (nuScenes) in urban scenarios.	This paper presents TCLC-GS, a novel tightly coupled LiDAR-Camera Gaussian Splatting method for rapid and high-quality 3D reconstruction and novel view synthesis in autonomous driving scenes.	Existing 3D Gaussian Splatting methods underutilize LiDAR data and the potential of LiDAR-camera fusion, limiting their accuracy and quality in complex urban environments.	TCLC-GS leverages a hybrid 3D representation with explicit (colorized 3D mesh) and implicit (hierarchical octree feature) information derived from LiDAR-camera data to enhance the initialization and optimization of 3D Gaussians.	TCLC-GS achieves state-of-the-art performance on the Waymo Open Dataset and nuScenes Dataset, surpassing baselines in image and depth synthesis quality. The method demonstrates fast training and enables real-time RGB and depth rendering at around 90 FPS (1920x1280) for Waymo and 120 FPS (1600x900) for nuScenes on a single NVIDIA RTX 3090 Ti. Ablation studies validate the effectiveness of the colorized 3D mesh, octree implicit representation, and dense depth supervision in improving performance.	The depth synthesis performance depends on the density of LiDAR data, showing relatively lower accuracy on the sparser nuScenes dataset compared to the Waymo dataset. Future work could explore the integration of temporal information and dynamic object modeling within the TCLC-GS framework.	lidar-camera fusion, gaussian splatting, 3d reconstruction, novel view synthesis, autonomous driving
2404.02241 Report	Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better	Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang	Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find that high-quality model weights often lie in a basin which cannot be reached by SGD but can be obtained by proper checkpoint averaging. Based on these observations, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: $\textbf{(a) Reducing training cost.}$ With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23$\times$ on CIFAR-10 and 15$\times$ on ImageNet-64). $\textbf{(b) Enhancing pre-trained models.}$ Assuming full training is already done, LCSC can further improve the generation quality or speed of the final converged models. For example, LCSC achieves better performance using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality on CIFAR-10. Our code is available at https://github.com/imagination-research/LCSC.	This paper proposes LCSC, a method that enhances Diffusion Models (DM) and Consistency Models (CM) by linearly combining saved checkpoints along the training trajectory using coefficients determined by evolutionary search.	DM and CM training often under-utilizes intermediate checkpoints. This paper shows high-quality models often lie in basins reachable not by SGD but by proper checkpoint averaging, which LCSC enables.	Given saved checkpoints, LCSC employs an evolutionary algorithm to find optimal linear combination coefficients that minimize metrics like FID.	LCSC reduces training cost, achieving similar sample quality with fewer iterations/smaller batch sizes (e.g., 23x speedup for CM on CIFAR-10). LCSC enhances pre-trained models, improving generation quality/speed (e.g., better CM performance with 1 NFE than baseline with 2 NFE). Analysis suggests the optimal combination often involves negative coefficients, highlighting the limitations of traditional averaging like EMA.	Current search relies on evolutionary methods, limiting efficiency and potentially finding local optima. Exploring better optimization is needed. LCSC applies uniform coefficients across the model. Finer-grained partitioning (per-layer, per-timestep) might yield further gains.	diffusion models, consistency models, weight averaging, evolutionary search, generative models
2404.02155 Report	Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields	Joshua Ahn, Haochen Wang, Raymond A. Yeh, Greg Shakhnarovich	Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of volumetric densities in neural radiance fields, i.e., the densities double when scene size is halved, and vice versa. We call this property alpha invariance. For NeRFs to better maintain alpha invariance, we recommend 1) parameterizing both distance and volume densities in log space, and 2) a discretization-agnostic initialization strategy to guarantee high ray transmittance. We revisit a few popular radiance field models and find that these systems use various heuristics to deal with issues arising from scene scaling. We test their behaviors and show our recipe to be more robust.	This paper investigates the issue of alpha invariance in neural radiance fields (NeRFs), where the scale ambiguity of 3D scenes leads to magnitude ambiguity of volumetric densities.	A robust NeRF algorithm should perform consistently across different scene scales. This paper aims to address this challenge by proposing solutions for alpha invariance in NeRFs.	The authors analyze and ablate several popular NeRF architectures, including Vanilla NeRF, TensoRF, DVGO, Plenoxels, and Nerfacto, to study their alpha invariance properties. They propose two key modifications: 1) parameterizing both distance and volume densities in log space using a GumbelCDF activation and 2) a discretization-agnostic initialization strategy to guarantee high ray transmittance.	Empirically, volume density (σ) changes by a factor close to 1/k when scene size changes by k. Vanilla NeRF's MLPs with ReLU activation can produce large σ values but are prone to converging to poor local minima. Voxel variants (DVGO, Plenoxels, TensoRF) fail to converge without hardcoded heuristics to handle scene scaling.	The assumption of i.i.d. sampled density values during initialization, while simplifying, is imperfect. Further investigation is needed to match the default Plenoxels performance with the proposed modifications.	neural radiance fields, nerf, alpha invariance, volume rendering, scene scaling
2404.02154 Report	Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration	Akshay Dudhane, Omkar Thawakar, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang	All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. The requirement to tackle multiple degradations using the same model can lead to high-complexity designs with fixed configuration that lack the adaptability to more efficient alternatives. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment with a single round of training. This seamless switching is enabled by our weights-sharing mechanism, forming the core of our architecture and facilitating the reuse of initialized module weights. Further, to establish robust weights initialization, we introduce a dynamic pre-training strategy that trains variants of the proposed DyNet concurrently, thereby achieving a 50% reduction in GPU hours. To tackle the unavailability of large-scale dataset required in pre-training, we curate a high-quality, high-resolution image dataset named Million-IRD having 2M image samples. We validate our DyNet for image denoising, deraining, and dehazing in all-in-one setting, achieving state-of-the-art results with 31.34% reduction in GFlops and a 56.75% reduction in parameters compared to baseline models. The source codes and trained models are available at https://github.com/akshaydudhane16/DyNet.	This paper presents DyNet, a dynamic network architecture for efficient all-in-one image restoration, incorporating a novel weight-sharing mechanism to reduce parameters and improve computational efficiency.	Existing all-in-one image restoration methods have high computational costs and lack flexibility in model depth during training. DyNet addresses this by allowing for seamless switching between bulkier and lightweight variants while maintaining high accuracy.	DyNet utilizes a weight-sharing mechanism in an encoder-decoder architecture. Module weights are shared across subsequent modules at each level, controlled by a reuse frequency. A dynamic pre-training strategy is introduced to train both bulky and lightweight variants concurrently, using a new million-scale dataset, Million-IRD.	DyNet-L outperforms the baseline PromptIR by 0.82 dB on average across denoising, deraining, and dehazing tasks. DyNet-S, a lightweight variant, achieves a 0.59 dB average improvement over PromptIR with 31.34% fewer GFlops and 56.75% fewer parameters. The proposed dynamic pre-training strategy reduces training time by 50% compared to traditional methods.	The paper explores the performance of DyNet on a limited set of image restoration tasks. Further investigation into the impact of varying module weight reuse frequencies on model performance is left for future work.	image restoration, all-in-one restoration, dynamic network, weight sharing, large-scale pre-training
2404.02152 Report	GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image	Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bangbang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang, Zhaopeng Cui	Recently, we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However, due to the diversity of frameworks, there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper, we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal, we design a novel expression-aware modification generative model, which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process, we develop several techniques, including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools, implicit latent space guidance to enhance model convergence, and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/	GeneAvatar enables fine-grained 3D head avatar editing in various volumetric representations from a single-view image.	Existing 3D avatar editing methods lack adaptability across representations, user-friendliness, or fidelity across expressions and viewpoints.	The method utilizes an expression-aware modification generative model. It learns expression-dependent 3D modifications from a single edited image and applies them consistently across different expressions and viewpoints.	The method generates consistent editing results across viewpoints and expressions. It is adaptable to various 3DMM-driven volumetric avatar representations. It supports both global and local editing using off-the-shelf 2D editing tools.	The method currently cannot handle adding new objects or changing hairstyles. Improving editing speed for real-time applications is a future direction.	3d avatar editing, volumetric representation, neural radiance fields, 3dmm, single-view editing
2404.02148 Report	Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models	Zeyu Yang, Zijie Pan, Chun Gu, Li Zhang	Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models separately that can provide satisfactory dynamic and geometric priors respectively. To take advantage of both, this paper present Diffusion$^2$, a novel framework for dynamic 3D content creation that reconciles the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of pretrained video and multi-view diffusion models based on the probability structure of the target image array. Owing to the high parallelism of the proposed image generation process and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Additionally, our method circumvents the reliance on 4D data, thereby having the potential to benefit from the scaling of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework and its ability to flexibly handle various types of prompts.	This paper presents \textbf{\model{}}, a novel framework for dynamic 3D content creation that combines pretrained video and multi-view diffusion models to directly sample dense multi-view multi-frame images for efficient 4D content generation.	Existing 4D generation methods rely on scarce synchronized multi-view video data or suffer from slow optimization. This framework leverages vast available monocular video and static multi-view data to achieve efficient 4D generation.	The method leverages the conditional independence between geometry and dynamics in multi-view video frames. By blending scores from pretrained video and multi-view diffusion models, it directly samples image arrays, which are then used for 4D reconstruction.	Achieves comparable quality to state-of-the-art optimization-based methods in image-to-4D generation. Generates higher-fidelity and more consistent results than existing methods in video-to-4D generation. Successfully animates static 3D models with realistic and diverse dynamics.	Performance is limited by the quality of foundation diffusion models, especially for challenging viewpoints and thin structures. The assumption of conditional independence may not hold in cases with extreme rotations, although the method still works well in practice.	4d generation, diffusion models, multi-view synthesis, video generation, 3d reconstruction
2404.02145 Report	Iterated Learning Improves Compositionality in Large Vision-Language Models	Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna	A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.	This paper proposes an iterated learning algorithm for vision-language models, inspired by cultural transmission in humans, to improve compositionality in representation learning.	Current vision-language models struggle with compositionality, failing to generalize understandings from individual concepts to complex scenes, limiting their ability to understand novel compositions.	The method reframes contrastive learning as a Lewis Signaling Game, incorporating a shared codebook as a communication bottleneck, and iteratively resetting the language agent to simulate cultural transmission.	Iterated learning leads to significantly improved performance on compositionality benchmarks (CREPE, SugarCrepe, Cola, Winoground) compared to standard CLIP and other baselines. The learned representations are empirically shown to be "easier to learn" for new language agents, supporting the hypothesis drawn from cognitive science. Iterated learning maintains comparable performance to standard training on image recognition tasks, indicating no sacrifice in recognition ability for improved compositionality.	The learning process can be unstable due to randomness introduced when resetting language agents. Future work could explore more stable training strategies and investigate the applicability of iterated learning to other domains beyond vision and language.	compositionality, vision-language models, iterated learning, cultural transmission, contrastive learning
2404.02125 Report	3D Congealing: 3D-Aware Image Alignment in the Wild	Yunzhi Zhang, Zizhang Li, Amit Raj, Andreas Engelhardt, Yuanzhen Li, Tingbo Hou, Jiajun Wu, Varun Jampani	We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as correspondence matching, pose estimation, and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections.	Introduces 3D-Aware Image Alignment in the Wild (3D-Cong), a novel method to align images of semantically similar objects in a shared 3D space, without relying on shape templates, poses, or camera parameters.	Enables various downstream tasks like 6-DoF object pose estimation, pose-aware image filtering, and image editing by establishing 2D-3D correspondence between input images and a canonical 3D representation.	Fuses prior 3D knowledge from a pre-trained text-to-image generative model with semantic information from input images using pre-trained semantic feature extractors (DINO). Optimizes for a canonical 3D shape, individual image poses, and dense 2D-3D correspondence maps.	Achieves comparable pose estimation accuracy to state-of-the-art methods requiring pose priors on a challenging multi-illumination dataset. Successfully aligns diverse internet images of objects and landmarks, demonstrating robustness to variations in appearance, viewpoint, and illumination. Enables applications like image editing by establishing dense 2D-2D correspondences through the shared 3D space, outperforming direct feature matching.	Performance depends on the accuracy of the initial shape generated by the pre-trained model. Feature ambiguity in objects with high symmetry can lead to incorrect pose estimations.	3d alignment, image congealing, pose estimation, generative models, semantic features
2404.02101 Report	CameraCtrl: Enabling Camera Control for Text-to-Video Generation	Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, Ceyuan Yang	Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs. Our project website is at: https://hehao13.github.io/projects-CameraCtrl/.	Introduces CameraCtrl, a plug-and-play camera control module for text-to-video (T2V) generation, enabling precise control over camera viewpoints.	Existing T2V models lack precise control over camera viewpoints, crucial for realism and user engagement.	Utilizes Plücker embeddings to represent camera parameters and incorporates a camera encoder trained on a dataset with diverse camera poses and similar appearance to the base T2V model.	Achieves more precise camera control compared to AnimateDiff and MotionCtrl. Demonstrates generalizability by effectively controlling camera viewpoints in various video domains and integrating with other video control methods like SparseCtrl. A comprehensive study on training datasets reveals that data with similar appearance and diverse camera poses, like RealEstate10K, yields the best results.	Generalization relies on the diversity of training data, future work could focus on collecting more diverse videos. Current work evaluates CameraCtrl primarily on U-Net based T2V models, future work could explore compatibility with transformer-based generators like Sora.	camera control, text-to-video generation, diffusion models, plücker embeddings, controllable video generation
2404.01984 Report	Fashion Style Editing with Generative Human Prior	Chaerin Kong, Seungyong Lee, Soohyeok Im, Wonsuk Yang	Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.	This paper presents FaSE, a framework for fashion style editing of human images using text descriptions, addressing the limitations of existing methods in handling complex domains like fashion.	Fashion style editing with text descriptions is a challenging task due to the complexity of human imagery and the subjective nature of fashion concepts. Existing text-driven methods fall short in providing sufficient guidance for this task.	FaSE leverages a generative human prior (StyleGAN-Human) and enhances text guidance using two methods: 1) textual augmentation with a large language model and 2) visual referencing by retrieving similar images from a fashion database and guiding the model in the latent space.	FaSE successfully edits human images according to fashion style prompts, outperforming baseline methods. The authors found that both textual augmentation and visual referencing significantly improve editing performance. Empirical analysis of the StyleGAN-Human latent space reveals a hierarchical structure where mid-level features control garment shape and fine-level features control texture.	The reference database is limited in size and diversity. The retrieval mechanism for reference images could be further improved.	image editing, fashion style editing, text-driven image manipulation, generative adversarial networks, vision-language models
2404.01843 Report	Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation	Wangguandong Zheng, Haifeng Xia, Rui Chen, Ming Shao, Siyu Xia, Zhengming Ding	Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimization via a distribution transfer mechanism, color optimization with a straightforward MSE loss and sketch similarity optimization with a CLIP-based geometric similarity loss. Extensive visual comparisons and quantitative analysis illustrate the advantage of our Sketch3D in generating realistic 3D assets while preserving consistency with the input.	Sketch3D, a novel framework for generating realistic 3D assets from sketches, aligning shape with the input and color with textual descriptions.	Existing sketch-to-3D methods struggle with limited color information, single-category generation, and lack of realism. Sketch3D addresses these limitations by leveraging both sketch and text prompts for realistic and customizable 3D asset creation.	1. Reference Image Generation: Create a color image from the sketch and text prompt using ControlNet. 2. 3D Prior Initialization: Generate a coarse 3D Gaussian representation from the reference image using a 3D diffusion model. 3. Style-Consistent Optimization: Generate multi-view guidance images with IP-Adapter and optimize the 3D Gaussian representation for structure, color, and sketch similarity using a distribution transfer mechanism, MSE loss, and CLIP-based geometric similarity loss, respectively.	Sketch3D outperforms baselines in generating realistic 3D assets with consistent shapes and colors. Quantitative analysis using CLIP similarity and SSIM demonstrates Sketch3D's superior alignment with input sketches and text prompts. Ablation studies validate the effectiveness of the proposed distribution transfer mechanism, MSE loss, and CLIP geometric similarity loss.	Generation quality is limited by the performance of ControlNet in generating the reference image. Achieving fine-grained control over details in complex sketches remains challenging.	sketch-to-3d generation, 3d gaussian splatting, text-guided synthesis, style-consistent guidance, controllable image synthesis
2404.01810 Report	Surface Reconstruction from Gaussian Splatting via Novel Stereo Views	Yaniv Wolf, Amit Bracha, Ron Kimmel	The Gaussian splatting for radiance field rendering method has recently emerged as an efficient approach for accurate scene representation. It optimizes the location, size, color, and shape of a cloud of 3D Gaussian elements to visually match, after projection, or splatting, a set of given images taken from various viewing directions. And yet, despite the proximity of Gaussian elements to the shape boundaries, direct surface reconstruction of objects in the scene is a challenge. We propose a novel approach for surface reconstruction from Gaussian splatting models. Rather than relying on the Gaussian elements' locations as a prior for surface reconstruction, we leverage the superior novel-view synthesis capabilities of 3DGS. To that end, we use the Gaussian splatting model to render pairs of stereo-calibrated novel views from which we extract depth profiles using a stereo matching method. We then combine the extracted RGB-D images into a geometrically consistent surface. The resulting reconstruction is more accurate and shows finer details when compared to other methods for surface reconstruction from Gaussian splatting models, while requiring significantly less compute time compared to other surface reconstruction methods. We performed extensive testing of the proposed method on in-the-wild scenes, taken by a smartphone, showcasing its superior reconstruction abilities. Additionally, we tested the proposed method on the Tanks and Temples benchmark, and it has surpassed the current leading method for surface reconstruction from Gaussian splatting models. Project page: https://gs2mesh.github.io/.	This paper introduces a novel method for surface reconstruction from 3D Gaussian Splatting (3DGS) models by leveraging the generation of stereo-calibrated novel views and applying a stereo matching algorithm.	Directly reconstructing surfaces from 3DGS models is challenging due to the misalignment between Gaussian element locations and the actual surface geometry. Existing methods either produce noisy results or require extensive computational time.	The pipeline involves capturing a scene with 3DGS, rendering stereo-calibrated novel views, extracting depth maps using a stereo matching algorithm (DLNR), and fusing the depth data using the Truncated Signed Distance Function (TSDF) algorithm to generate a smooth and consistent mesh.	Outperforms SuGaR, the current state-of-the-art method for surface reconstruction from 3DGS, on the Tanks and Temples benchmark. Achieves comparable visual quality to neural reconstruction methods like BakedSDF on the Mip-NeRF360 dataset while requiring significantly less processing time. Demonstrates superior performance in reconstructing accurate and noise-free meshes from in-the-wild scenes captured using smartphones.	Reconstruction quality depends on the accuracy of the initial 3DGS scene capture. The stereo matching algorithm used is inherently susceptible to issues with transparent surfaces, potentially affecting reconstruction accuracy in those areas.	surface reconstruction, gaussian splatting, 3dgs, stereo matching, novel view synthesis
2404.01717 Report	AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation	Rui Xie, Ying Tai, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Xiaoqian Ye, Qian Wang, Jian Yang	Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient adversarial diffusion distillation (ADD), we design~\name~to address this issue by incorporating the ideas of both distillation and ControlNet. Specifically, we first propose a prediction-based self-refinement strategy to provide high-frequency information in the student model output with marginal additional time cost. Furthermore, we refine the training process by employing HR images, rather than LR images, to regulate the teacher model, providing a more robust constraint for distillation. Second, we introduce a timestep-adaptive ADD to address the perception-distortion imbalance problem introduced by original ADD. Extensive experiments demonstrate our~\name~generates better restoration results, while achieving faster speed than previous SD-based state-of-the-art models (e.g., $7$$\times$ faster than SeeSR).	Proposes AddSR, an efficient and effective Stable Diffusion based model for blind super-resolution, achieving high perceptual quality within a few sampling steps by incorporating distillation and ControlNet.	Existing blind super-resolution methods based on stable diffusion, while powerful, suffer from poor efficiency due to the need for hundreds or thousands of sampling steps, hindering their practical use.	AddSR utilizes a teacher-student distillation framework with several key innovations: a prediction-based self-refinement (PSR) strategy to provide high-frequency details, training the teacher model on HR images for better guidance, and a timestep-adaptive adversarial diffusion distillation (TA-ADD) to balance perception and distortion.	AddSR-4 achieves state-of-the-art results on perceptual quality metrics (MANIQA, MUSIQ, CLIPIQA) across various degradation levels and real-world images. AddSR significantly reduces inference steps compared to other SD-based methods, achieving comparable results to SeeSR in just 1-4 steps and being 7 times faster. The effectiveness of PSR and TA-ADD is validated through ablation studies, showing improvements in perceptual quality, fidelity, and reduced hallucinations.	Despite speed improvements, AddSR's inference time still lags behind GAN-based methods due to the complexity of SD and ControlNet. Future work will focus on streamlining network architecture for greater efficiency.	blind super-resolution, stable diffusion, knowledge distillation, controlnet, perception-distortion trade-off
2404.01709 Report	Upsample Guidance: Scale Up Diffusion Models without Training	Juno Hwang, Yong-Hyun Park, Junghyo Jo	Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.	This paper introduces "upsample guidance (UG)", a novel technique to adapt pre-trained diffusion models to generate higher-resolution images without additional training or external models.	Generating high-resolution images with diffusion models is challenging. Existing solutions require modifications to architecture, training from scratch, or using external models, leading to increased computational costs.	UG adds a single term to the sampling process, derived from signal-to-noise ratio (SNR) matching, which guides the model towards consistency with the trained low-resolution component.	UG successfully generates high-resolution images across various diffusion models, including pixel-space, latent-space, and video diffusion models. The method effectively resolves artifacts and improves image quality, fidelity, and prompt alignment by adjusting the guidance scale. UG incurs minimal computational overhead, especially with recent advancements in fast sampling techniques.	The current implementation relies on a simple guidance scale design, which could be further improved. While spatial upsampling is well-explored, further research is needed for optimal temporal upsampling in video and audio models.	diffusion models, high-resolution image generation, upsampling, signal-to-noise ratio matching, guidance
2404.01543 Report	Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes	Ziqian Bai, Feitong Tan, Sean Fanello, Rohit Pandey, Mingsong Dou, Shichen Liu, Ping Tan, Yinda Zhang	3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.	This paper introduces a novel 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality.	Current state-of-the-art 3D head avatars, while photorealistic, are computationally expensive and impractical for real-time applications such as VR and teleconferencing.	The paper introduces “local hash table blendshapes”, small hash tables attached to vertices of an underlying face parametric model. These are linearly merged with weights predicted by a CNN, resulting in expression-dependent embeddings for efficient density and color predictions using a lightweight MLP, further accelerated by a hierarchical nearest neighbor search.	The model achieves real-time rendering (over 30 FPS at 512x512 resolution). It maintains comparable rendering quality to state-of-the-art methods like MonoAvatar. It produces significantly better results on challenging expressions compared to existing efficient avatars like NeRFBlendshape and INSTA.	The model exhibits floaters under viewpoints and expressions far from the training distribution. Performance is less stable around the mouth interior due to tracking limitations. Future work involves exploring more expensive training strategies like adversarial loss or joint face fitting refinement to mitigate limitations and enhance quality.	3d head avatar, neural implicit representation, real-time rendering, hash encoding, facial expression
2404.01424 Report	DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery	Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu	The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.	This paper proposes DPMesh, a novel framework for recovering 3D human mesh from images, especially under severe occlusion, by leveraging the structure and spatial relationship knowledge from pre-trained text-to-image diffusion models.	Recovering occluded human mesh from images remains a significant challenge for existing methods due to the difficulty in extracting effective features under severe occlusion. Diffusion models offer a promising alternative with their rich prior knowledge of object structure and spatial relationships.	DPMesh utilizes a pre-trained text-to-image diffusion model as the backbone for single-step feature extraction. It injects refined 2D keypoint information as conditions to guide the denoising U-Net. Moreover, a noisy key-point reasoning approach is introduced to enhance robustness against noisy 2D observations.	DPMesh outperforms state-of-the-art methods on various occlusion benchmarks, including 3DPW-OC, 3DPW-PC, 3DOH, and 3DPW-Crowd. The diffusion-based backbone effectively captures occlusion-aware information, as visualized in the cross-attention maps. Ablation studies validate the contribution of the diffusion-based backbone, condition injection, and noisy key-point reasoning to the overall performance.	The reliance on an off-the-shelf 2D key-point detector introduces sensitivity to the detector's performance. Future work could explore extending DPMesh to handle multi-view images or video sequences for enhanced accuracy and temporal consistency.	human mesh recovery, occlusion handling, diffusion models, computer vision, pose estimation
2404.01367 Report	Bigger is not Always Better: Scaling Properties of Latent Diffusion Models	Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar	We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.	This paper investigates the scaling properties of Latent Diffusion Models (LDMs) for image generation, focusing on the relationship between model size and sampling efficiency.	LDMs are powerful but computationally expensive. Understanding how model size affects efficiency is crucial for optimizing their performance under real-world constraints.	The authors trained a suite of LDMs ranging from 39 million to 5 billion parameters, evaluating their performance on text-to-image generation and downstream tasks like super-resolution and Dreambooth.	Pretraining performance scales with training compute, but smaller models can be more efficient under limited sampling budgets. The efficiency trends hold across different diffusion samplers (DDIM, DDPM, DPM-Solver++) and are also observed in distilled LDMs. Larger models generally show better downstream performance after fine-tuning, highlighting the importance of pretraining quality.	The evaluation relies on FID and CLIP scores, which might not perfectly correlate with human perception of visual quality. The study focuses on a specific LDM architecture. Further research is needed to generalize the findings to other LDM families, especially transformer-based ones.	latent diffusion models, sampling efficiency, scaling laws, text-to-image generation, diffusion distillation
2404.01300 Report	NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields	Muhammad Zubair Irshad, Sergey Zakahrov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus	Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.6 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.	Introduces NeRF-MAE, the first self-supervised 3D pre-training method for Neural Radiance Fields using a masked autoencoder approach.	Leverages the dense and regular structure of NeRF's radiance and density grid to learn effective 3D representations from readily available posed RGB images, overcoming limitations of sparse and irregular 3D representations like point clouds.	Extracts an explicit 4D radiance and density grid from a trained NeRF model. Employs a masked autoencoder architecture with a 3D Swin Transformer encoder and a voxel decoder to reconstruct masked patches of the grid, learning semantic and spatial relationships within 3D scenes.	Significantly outperforms state-of-the-art self-supervised 3D pre-training methods and NeRF-based scene understanding baselines on tasks like 3D object detection and semantic voxel labeling. Demonstrates strong generalization capabilities, achieving superior performance on cross-dataset transfer tasks. Showcases scalability, with performance improving as the amount and quality of pre-training data increase.	Training efficiency can be further improved to handle larger and more diverse datasets. Exploring the integration of neural rendering and masking for enhanced representation learning.	neural radiance fields, 3d representation learning, self-supervised learning, masked autoencoders, 3d vision transformers
2404.01297 Report	Streaming Dense Video Captioning	Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid	An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.	The paper introduces a novel streaming model for dense video captioning, aiming to address the limitations of existing models in handling long videos and producing detailed descriptions.	Existing dense video captioning models struggle with long videos due to computational constraints and often produce limited descriptions. This work proposes a streaming approach to overcome these limitations, enabling real-time processing and richer event descriptions.	The proposed model employs two key components: 1) a memory module based on K-means clustering to efficiently process long video inputs with a fixed computational budget, and 2) a streaming decoding algorithm that predicts event captions sequentially at intermediate timestamps (decoding points) using the memory features and previously predicted captions.	The streaming model significantly outperforms state-of-the-art methods on three dense video captioning benchmarks (ActivityNet, YouCook2, ViTT) by up to 11.0 CIDEr points. The clustering-based memory module proves effective in capturing diverse video information, outperforming alternative memory mechanisms like EMA and token merging. Increasing the number of decoding points during training enhances performance by providing more supervision and aligning memory features better with target captions.	The model occasionally produces duplicate predictions, even with prefix context, suggesting a need for exploring non-maximal suppression techniques in future work. Future work could explore a dedicated benchmark for dense video captioning of long videos to evaluate the model's performance more comprehensively.	dense video captioning, streaming models, memory modules, k-means clustering, decoding points
2404.01296 Report	MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space	Armand Comas-Massagué, Di Qiu, Menglei Chai, Marcel Bühler, Amit Raj, Ruiqi Gao, Qiangeng Xu, Mark Matthews, Paulo Gotardo, Octavia Camps, Sergio Orts-Escolano, Thabo Beeler	We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts. You can find more results and videos in our website: https://syntec-research.github.io/MagicMirror	MagicMirror is a novel framework for fast, text-guided 3D avatar head generation and personalization, leveraging text-to-image diffusion models and conditional Neural Radiance Fields (NeRFs).	Existing methods for text-guided 3D avatar generation struggle with photorealism, multi-view consistency, and limited customization options. MagicMirror addresses these limitations to achieve higher quality and faithfulness to text prompts.	MagicMirror employs a conditional NeRF model trained on a diverse multi-view dataset to create a constrained solution space for efficient optimization. It utilizes text-to-image diffusion models as geometry and texture priors for high-quality stylization. A variational score distillation (VSD) objective guides the optimization, improving realism and detail.	MagicMirror generates high-quality, personalized 3D avatars with detailed geometry and textures, outperforming existing methods in visual fidelity and text alignment. The framework allows for intuitive customization through text prompts, enabling modifications to facial features, expressions, accessories, and styles. MagicMirror effectively leverages personalized and generic diffusion priors, enabling a balance between identity preservation and creative exploration.	Generating undefined shapes, like hair, remains challenging, particularly outside the facial region. Creating new, detached volumes from scratch, such as hands, is not always successful due to limitations in the initial model's training data.	3d avatar generation, text-guided synthesis, neural radiance fields (nerfs), text-to-image diffusion models, avatar personalization
2404.01294 Report	CosmicMan: A Text-to-Image Foundation Model for Humans	Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu	We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.	This paper introduces CosmicMan, a specialized text-to-image foundation model for generating high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment, addressing the limitations of general-purpose models in human-centric content generation.	Current general-purpose text-to-image models struggle with generating realistic and diverse human images, particularly in capturing nuanced details of human anatomy and attire, hindering downstream human-centric content generation tasks.	The authors propose Annotate Anyone, a human-AI cooperative data production paradigm, to build a large-scale, high-quality dataset called CosmicMan-HQ. They also introduce Daring, a training framework that decomposes text descriptions into groups aligned with human body structure, enforcing attention refocusing in the model to improve text-image alignment.	CosmicMan outperforms state-of-the-art text-to-image models in generating high-fidelity human images, exhibiting superior performance in both quantitative metrics (FID, Semantic Acc) and human preference evaluations. Annotate Anyone proves effective in constructing a large-scale, high-quality human-centric dataset, CosmicMan-HQ, which contributes significantly to the model's performance. The Daring training framework, specifically the HOLA loss and data discretization, effectively enhances the model's ability to accurately generate images aligned with detailed descriptions, particularly for dense concepts related to human appearance.	The authors acknowledge the need for continuous operation of Annotate Anyone to produce subsequent versions of CosmicMan-HQ, dynamically aligning with evolving real-world data. Future work includes providing up-to-date human-specialized foundation models trained on new versions of their dataset to support long-term research in human-centric content generation.	text-to-image generation, foundation models, human-centric content generation, data production, text-image alignment
2404.01292 Report	Measuring Style Similarity in Diffusion Models	Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, Tom Goldstein	Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.	This paper presents a new method for extracting style descriptors from images, enabling style-based image retrieval and analysis of style replication in text-to-image models like Stable Diffusion.	As generative models become increasingly used, it's crucial to understand how they replicate style from training data, both for copyright concerns and for understanding the model's capabilities.	The authors curate a new dataset, LAION-Styles, from LAION-Aesthetics, and train a Vision Transformer model with a combination of self-supervised and multi-label contrastive learning objectives tailored for style representation.	The proposed model, CSD, outperforms existing style attribution models and pre-trained feature extractors on style-based image retrieval tasks across DomainNet, WikiArt, and LAION-Styles datasets. Analysis of Stable Diffusion reveals a correlation between prompt complexity and the degree of style copying, with more complex prompts leading to increased style replication. The model can be used to identify which artists' styles are more likely to be replicated by Stable Diffusion, and to explore how styles generalize to out-of-distribution content.	The LAION-Styles dataset, while curated, still contains noise in the form of missing or incorrect tags. The evaluation assumes strict adherence of the generative model to the prompts, which may not always hold true.	style representation, style retrieval, text-to-image generation, stable diffusion, style copying
2404.01291 Report	Evaluating Text-to-Visual Generation with Image-to-Text Generation	Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan	Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.	This paper introduces VQAScore, a simple yet effective metric for evaluating text-to-visual generation models that surpasses current metrics and doesn't rely on expensive human feedback or proprietary models.	Comprehensive and reliable evaluation of text-to-visual generative AI remains challenging due to a lack of effective metrics and standardized benchmarks, particularly for complex prompts involving compositions.	VQAScore leverages visual question answering (VQA) by calculating the probability of a "Yes" answer to a question like "Does this figure show {text}?". It also introduces a new bidirectional VQA model, CLIP-FlanT5, and a challenging benchmark, GenAI-Bench, featuring compositional prompts and human ratings.	VQAScore outperforms prior art on challenging compositional image-text matching benchmarks (Winoground and EqBen). VQAScore achieves state-of-the-art correlation with human judgments on alignment benchmarks. VQAScore can be extended to evaluate text-to-video and text-to-3D models by averaging scores across sampled frames or rendered views.	VQAScore currently does not evaluate aspects like toxicity, bias, aesthetics, video motion, and 3D physics. Future work could fine-tune VQAScore with relevant data to address these limitations.	generative ai, text-to-visual generation, evaluation metrics, vqascore, genai-bench
2404.01284 Report	Large Motion Model for Unified Multi-Modal Motion Generation	Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu	Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.	This paper introduces LMM (Large Motion Model), a generalist, multi-modal framework that unifies various motion generation tasks into a single model, leveraging a comprehensive dataset called MotionVerse.	Existing motion generation models are often specialist models limited by data quantity and domain, resulting in poor generalization. LMM aims to overcome these limitations by leveraging diverse motion data for broader generalization.	The authors consolidate 16 motion datasets into MotionVerse, addressing inconsistencies in pose representation, keypoints, and frame rates. LMM, built on a transformer-based diffusion model with a novel attention mechanism (ArtAttention), is pretrained with random frame rates and masking techniques before fine-tuning on specific tasks.	LMM achieves state-of-the-art results on text-to-motion generation tasks, outperforming specialist models in accuracy and fidelity. In motion prediction, LMM demonstrates superior performance, particularly in long-distance prediction, attributed to its robust motion prior learned from large-scale data. LMM shows competitive performance in music-to-dance generation, with significant advantages in diversity metrics, highlighting its ability to leverage multi-modal data.	The current intermediate representation cannot handle missing individual keypoints within a body part, limiting its flexibility. The use of motion translators introduces noise, decreasing motion quality. Future work will focus on more flexible motion representation and modeling.	motion generation, unified model, multi-modality, diffusion model, large motion model
2404.01247 Report	An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance	Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig	Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.	This paper introduces the task of "image transcreation", aiming to culturally adapt images using machine learning for diverse audiences.	With the rise of multimedia content, translating visual elements like images for cultural relevance is crucial alongside text, yet remains unaddressed.	The authors build three pipelines using generative models: 1) direct instruction-based editing, 2) caption-edit-image edit, and 3) caption-edit-image retrieval. They also create a two-part evaluation dataset ("concept" and "application") with images from 7 countries.	Image-editing models struggle to grasp cultural context, but improve with LLMs and retrieval methods. The best pipeline achieves only 5% successful translation for certain countries in the simpler "concept" dataset. No successful translations are found for some countries in the harder "application" dataset, highlighting the task's difficulty.	Cultural categorization solely based on country is a limitation acknowledged by the authors. Limited language and country coverage due to resource constraints.	image transcreation, cultural adaptation, multimodal translation, generative models, human evaluation
2404.01241 Report	StructLDM: Structured Latent Diffusion for 3D Human Generation	Tao Hu, Fangzhou Hong, Ziwei Liu	Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/.	This paper presents StructLDM, a novel diffusion-based 3D human generative model that utilizes a structured 2D latent space representing the human body surface.	Existing 3D human generative methods employ limited 1D latent spaces, hindering controllability and realism. StructLDM addresses these limitations by leveraging a higher-dimensional, semantically meaningful representation.	StructLDM employs a two-stage approach: 1) training a structured auto-decoder to embed human subjects into a 2D latent space aligned with a human body mesh, and 2) training a latent diffusion model in this structured space to facilitate diverse and realistic human generation.	Achieves state-of-the-art generation quality on three datasets, outperforming existing 3D-aware GANs in terms of FID and user-study evaluations. Enables controllable generation by manipulating pose, view, and shape, as well as editing capabilities like compositional generation and part-aware modifications (e.g., 3D virtual try-on). Demonstrates the superiority of the structured 2D latent space over traditional 1D representations for capturing fine details and enabling local editing.	Limited diversity due to reliance on training from scratch and the lack of a large-scale, accurate 3D human dataset. Challenges in learning from single-view images, though promising results are shown on the DeepFashion dataset.	3d human generation, latent diffusion model, structured latent representation, controllable generation, 3d virtual try-on
2404.01203 Report	Video Interpolation with Diffusion Models	Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hołyński, Ben Poole, Janne Kontkanen	We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.	VIDIM, a cascaded diffusion model for video interpolation, generates high-quality videos between two input frames, particularly excelling in scenarios with complex, nonlinear, or ambiguous motion.	Existing video interpolation methods struggle with complex or ambiguous motion. VIDIM addresses this limitation by leveraging the generative capabilities of diffusion models to produce plausible interpolations even in challenging cases.	VIDIM uses a two-stage diffusion model: a base model generates low-resolution interpolating frames, and a super-resolution model enhances their resolution conditioned on the original high-resolution input frames. Both models share parameters across frames and employ classifier-free guidance for enhanced quality.	VIDIM outperforms state-of-the-art methods in generative metrics (FID, FVD) on challenging datasets with large and ambiguous motions. Human evaluations strongly favor VIDIM for generating more realistic videos compared to baselines. Ablation studies confirm the importance of explicit frame conditioning and classifier-free guidance in achieving high-quality results.	VIDIM currently operates at a fixed resolution and aspect ratio, limiting its flexibility. Future work includes exploring techniques for arbitrary aspect ratio generation and further enhancing the super-resolution model's quality.	video interpolation, diffusion models, generative models, classifier-free guidance, deep learning
2404.01197 Report	Getting it Right: Improving Spatial Consistency in Text-to-Image Models	Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang	One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.	This paper introduces SPRIGHT, a spatially focused vision-language dataset aimed at improving spatial consistency in text-to-image models. The authors also propose an efficient fine-tuning method that optimizes model performance on spatial relationships.	Current text-to-image models struggle to accurately represent spatial relationships described in text prompts. This work addresses this limitation by providing a high-quality dataset and an effective training strategy.	The authors create SPRIGHT by re-captioning 6 million images from existing datasets with a focus on spatial relationships. They fine-tune Stable Diffusion models on SPRIGHT using a novel approach that prioritizes images with a high density of objects.	SPRIGHT significantly improves the representation of spatial relationships compared to existing datasets. Fine-tuning on SPRIGHT leads to significant performance gains on spatial reasoning benchmarks (VISOR, T2I-CompBench) while also improving image fidelity metrics (FID, CMMD). An efficient training methodology utilizing images with many objects achieves state-of-the-art performance on T2I-CompBench Spatial Score.	SPRIGHT, being a derived dataset, inherits potential limitations from the original datasets used for captioning. The accuracy of synthetic captions, while high, can be further improved with advanced prompting techniques and models.	text-to-image synthesis, spatial reasoning, vision-language models, dataset creation, stable diffusion
2404.01143 Report	Condition-Aware Neural Network for Controlled Image Generation	Han Cai, Muyang Li, Zhuoyang Zhang, Qinsheng Zhang, Ming-Yu Liu, Song Han	We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step.	Introduces Condition-Aware Neural Network (CAN), a method for controlling image generation by dynamically manipulating neural network weights based on input conditions.	Improves controllability and efficiency of image generative models, enabling them to better follow user instructions and be deployed on resource-constrained devices.	Introduces a condition-aware weight generation module that generates conditional weights for convolution/linear layers based on input conditions, which are then fused with static weights during training and inference.	Significantly improves image quality and controllability over baseline models on ImageNet and COCO datasets. Outperforms prior conditional control methods like adaptive normalization and attention-based methods. Enables development of CaT, a new family of efficient diffusion transformers that achieve state-of-the-art results with significantly lower computational cost.	Current implementation incurs 30-40% training overhead compared to static models due to reliance on grouped convolution. Large-scale text-to-image generation and video generation applications are left for future work.	controlled image generation, diffusion models, dynamic neural networks, weight generation networks, efficient deep learning
2404.01133 Report	CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians	Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, Zhaoxiang Zhang	The advancement of real-time 3D scene reconstruction and novel view synthesis has been significantly propelled by 3D Gaussian Splatting (3DGS). However, effectively training large-scale 3DGS and rendering it in real-time across various scales remains challenging. This paper introduces CityGaussian (CityGS), which employs a novel divide-and-conquer training approach and Level-of-Detail (LoD) strategy for efficient large-scale 3DGS training and rendering. Specifically, the global scene prior and adaptive training data selection enables efficient training and seamless fusion. Based on fused Gaussian primitives, we generate different detail levels through compression, and realize fast rendering across various scales through the proposed block-wise detail levels selection and aggregation strategy. Extensive experimental results on large-scale scenes demonstrate that our approach attains state-of-theart rendering quality, enabling consistent real-time rendering of largescale scenes across vastly different scales. Our project page is available at https://dekuliutesla.github.io/citygs/.	This paper introduces CityGaussian (CityGS), a novel method for real-time, high-quality rendering of large-scale scenes using 3D Gaussian Splatting (3DGS). It employs a divide-and-conquer training approach with a global scene prior and Level-of-Detail (LoD) for efficient rendering across different scales.	Effectively training large-scale 3DGS models and rendering them in real-time across various scales is challenging due to high memory and computational demands. This paper addresses these limitations.	CityGS divides the scene into blocks, each trained in parallel with a global Gaussian prior for consistent fusion. It compresses Gaussians into different detail levels and uses a block-wise LoD strategy for efficient rendering.	CityGS achieves state-of-the-art rendering quality on large-scale scenes, outperforming NeRF-based methods in SSIM, PSNR, and LPIPS. The proposed LoD strategy enables real-time rendering even under drastically different scales with minimal quality loss. CityGS allows for efficient scene manipulation due to its explicit representation of the scene.	The assumption of a static scene limits the generalization ability of the current method. Future work includes exploring the application of CityGS in dynamic scenes and improving performance with drastically different training views (e.g., aerial and street views).	3d scene reconstruction, novel view synthesis, 3d gaussian splatting, level of detail, large-scale scene rendering
2404.01089 Report	Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On	Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, Xiangmin Xu	Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases.	This paper proposes Texture-Preserving Diffusion (TPD), a novel diffusion-based virtual try-on model that enhances fidelity without additional image encoders.	Virtual try-on is important for online shopping, but existing methods struggle with fidelity, especially for garments with complex textures and challenging poses.	TPD introduces two key components: (1) Self-Attention-based Texture Transfer (SATT) concatenates masked person and garment images spatially, leveraging inherent self-attention in diffusion models for efficient texture transfer. (2) Decoupled Mask Prediction (DMP) iteratively predicts a precise inpainting mask based on both person and garment images, preserving details.	TPD generates high-quality try-on images with fewer artifacts, especially for complex textures. DMP effectively preserves body details, such as arms or tattoos, by minimizing the removal of irrelevant information. Quantitative evaluations show TPD consistently outperforms state-of-the-art methods on VITON and VITON-HD datasets.	The model's performance on images with complex backgrounds, as opposed to single-color backgrounds prevalent in datasets, needs further exploration. Future work includes extending TPD to handle multi-garment try-on scenarios.	virtual try-on, diffusion models, image synthesis, self-attention, inpainting
2404.00987 Report	FlexiDreamer: Single Image-to-3D Generation with FlexiCubes	Ruowen Zhao, Zhengyi Wang, Yikai Wang, Zihan Zhou, Jun Zhu	3D content generation from text prompts or single images has made remarkable progress in quality and speed recently. One of its dominant paradigms involves generating consistent multi-view images followed by a sparse-view reconstruction. However, due to the challenge of directly deforming the mesh representation to approach the target topology, most methodologies learn an implicit representation (such as NeRF) during the sparse-view reconstruction and acquire the target mesh by a post-processing extraction. Although the implicit representation can effectively model rich 3D information, its training typically entails a long convergence time. In addition, the post-extraction operation from the implicit field also leads to undesirable visual artifacts. In this paper, we propose FlexiDreamer, a novel single image-to-3d generation framework that reconstructs the target mesh in an end-to-end manner. By leveraging a flexible gradient-based extraction known as FlexiCubes, our method circumvents the defects brought by the post-processing and facilitates a direct acquisition of the target mesh. Furthermore, we incorporate a multi-resolution hash grid encoding scheme that progressively activates the encoding levels into the implicit field in FlexiCubes to help capture geometric details for per-step optimization. Notably, FlexiDreamer recovers a dense 3D structure from a single-view image in approximately 1 minute on a single NVIDIA A100 GPU, outperforming previous methodologies by a large margin.	FlexiDreamer is a novel single image-to-3D generation framework that reconstructs the target mesh in an end-to-end manner by leveraging FlexiCubes for a direct acquisition of the target mesh, bypassing the need for post-processing steps common in NeRF-based methods.	Existing methods for 3D content generation from single images often rely on implicit representations like NeRF, leading to long training times and potential artifacts during post-processing extraction of the mesh. FlexiDreamer addresses these limitations by directly generating the target mesh in an end-to-end fashion.	FlexiDreamer uses a pre-trained diffusion model to generate multi-view RGB and normal images from a single input image. Then, it employs FlexiCubes, a flexible gradient-based surface extraction method, to extract an explicit mesh from a signed distance field encoded via a multi-resolution hash grid network. A texture neural field is also integrated to learn mesh surface texture. The entire framework is trained end-to-end using reconstruction losses from the rendered images.	FlexiDreamer recovers dense 3D structures from single-view images in approximately 1 minute, significantly faster than previous methods. It generates high-quality textured meshes with sharper geometric details and more distinct textures compared to baselines. The end-to-end pipeline avoids artifacts often introduced during post-processing extraction in NeRF-based approaches.	The quality of generated 3D assets depends heavily on the quality of multi-view images, which can be limited by the capabilities of current multi-view diffusion models. Limited perspectives of input images can hinder the accurate reconstruction of objects with complex geometries.	3d generation, diffusion models, flexicubes, single image-to-3d, sparse-view reconstruction
2404.00931 Report	GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields	Yunsong Wang, Hanlin Chen, Gim Hee Lee	Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However, the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume, and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism, which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation, eliminating the need for ground truth semantic labels or depth priors, and effectively generalize across scenes and datasets without fine-tuning.	Introduces GOV-NeSF, a novel generalizable open-vocabulary neural semantic field for 3D scenes, enabling open-vocabulary semantic segmentation in both 2D and 3D without requiring 3D data, depth priors, or explicit semantic labels during training.	Addresses the limitations of existing open-vocabulary 3D scene understanding methods that suffer from constrained generalizability due to framework design and reliance on 3D data.	Leverages a cost volume for geometry-aware feature extraction and proposes a Multi-view Joint Fusion module to blend colors and open-vocabulary features from multi-view images using cross-view attention, trained with supervision from novel views.	Achieves state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation on ScanNet and Replica datasets. Demonstrates significant improvements over existing methods when ground truth depth maps are unavailable, effectively learning occlusion reasoning implicitly. Exhibits strong generalizability, successfully transferring to unseen scenes and datasets without fine-tuning.	Rendering quality of color images can be blurry compared to methods using depth priors due to the focus on room-scale representation without depth information. Depth-guided masking, while improving 3D segmentation, can negatively impact 2D segmentation performance by creating empty holes in rendered images.	open-vocabulary learning, semantic segmentation, neural radiance fields, 3d scene understanding, generalizable vision
2404.00891 Report	Marrying NeRF with Feature Matching for One-step Pose Estimation	Ronghan Chen, Yang Cong, Yu Ren	Given the image collection of an object, we aim at building a real-time image-based pose estimation method, which requires neither its CAD model nor hours of object-specific training. Recent NeRF-based methods provide a promising solution by directly optimizing the pose from pixel loss between rendered and target images. However, during inference, they require long converging time, and suffer from local minima, making them impractical for real-time robot applications. We aim at solving this problem by marrying image matching with NeRF. With 2D matches and depth rendered by NeRF, we directly solve the pose in one step by building 2D-3D correspondences between target and initial view, thus allowing for real-time prediction. Moreover, to improve the accuracy of 2D-3D correspondences, we propose a 3D consistent point mining strategy, which effectively discards unfaithful points reconstruted by NeRF. Moreover, current NeRF-based methods naively optimizing pixel loss fail at occluded images. Thus, we further propose a 2D matches based sampling strategy to preclude the occluded area. Experimental results on representative datasets prove that our method outperforms state-of-the-art methods, and improves inference efficiency by 90x, achieving real-time prediction at 6 FPS.	This paper introduces a novel NeRF-based pose estimation method that leverages image matching for real-time, CAD-model-free pose estimation of novel objects.	Existing NeRF-based pose estimation techniques suffer from slow convergence and are prone to local minima, making them impractical for real-time applications.	The method uses a pre-trained NeRF model to render depth information and combines it with 2D feature matches to create 2D-3D correspondences. This allows for direct pose solving using PnP in a single step. Additionally, a 3D consistent point mining strategy is employed to enhance the accuracy of the correspondences by filtering out unreliable points. A keypoint-guided sampling strategy is also introduced to address occlusion challenges during pose refinement.	The proposed method achieves state-of-the-art pose estimation accuracy on both synthetic and real-world datasets. It significantly improves inference efficiency by 90x compared to previous NeRF-based methods, enabling real-time prediction at 6 FPS. The method exhibits strong robustness to occlusion, outperforming existing techniques.	The method's performance relies on the accuracy of the employed image matcher. Future work could explore extending the approach to handle object scales and incorporate it into robot manipulation or neural field-based SLAM tasks.	pose estimation, neural radiance fields (nerf), image matching, 3d consistent point mining, occlusion handling
2404.00879 Report	Model-Agnostic Human Preference Inversion in Diffusion Models	Jeeyung Kim, Ze Wang, Qiang Qiu	Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.	Proposed PAHI, a novel sampling design that optimizes noise distributions for one-step text-to-image generation, aligning with human preferences without fine-tuning diffusion models.	Efficient text-to-image generation is crucial, but low-step image generation often lacks quality. This work addresses the need for high-quality, efficient synthesis by exploring the impact of prior noise distribution in one-step generation.	Leveraged a distilled diffusion model as the generator and a scoring model (PickScore) to assess image quality based on human preferences. Optimized the noise distribution parameters by minimizing an objective function that maximizes the scores, employing a lightweight noise-predicting model to tailor noise distributions for individual prompts.	PAHI significantly outperforms standard Gaussian noise in one-step generation, achieving a win rate of 94.0% based on PickScore. The prompt-adaptive approach (PAHI) shows superior performance (94.0% win rate) compared to a single optimized noise distribution across all prompts (64.7% win rate). PAHI achieves higher quality images (based on PickScore and ImageReward) compared to one-step and two-step generation with standard Gaussian noise, while only adding a marginal increase in inference time.	The study primarily focuses on one-step generation, and further investigation is needed for multi-step scenarios. Exploration of alternative noise distributions beyond Gaussian could be beneficial.	text-to-image generation, diffusion models, one-step sampling, noise optimization, human preferences
2404.00878 Report	TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On	Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang	Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.	This paper proposes TryOn-Adapter, an efficient framework for virtual try-on that decouples clothing identity into fine-grained factors for enhanced controllability and training efficiency.	Existing diffusion-based virtual try-on methods struggle to maintain clothing identity and are computationally expensive to train.	The paper uses a pre-trained diffusion model with frozen parameters, except attention layers. It then integrates three lightweight modules: Style Preserving, Texture Highlighting, and Structure Adapting. A training-free T-RePaint strategy further enhances identity preservation during inference. An Enhanced Latent Blending Module is used to enhance the visual quality of the generated image.	Achieves state-of-the-art performance on VITON-HD and Dresscode datasets. Significantly reduces trainable parameters compared to full fine-tuning methods. Demonstrates superior preservation of garment style, texture, and structure.	The method is limited by the existing datasets, which hinders widespread practical application. Lack of targeted quantitative evaluation metrics for virtual try-on tasks.	virtual try-on, diffusion models, identity preservation, parameter efficient fine-tuning, generative adversarial networks
2404.00874 Report	DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF	Jie Long Lee, Chen Li, Gim Hee Lee	We present DiSR-NeRF, a diffusion-guided framework for view-consistent super-resolution (SR) NeRF. Unlike prior works, we circumvent the requirement for high-resolution (HR) reference images by leveraging existing powerful 2D super-resolution models. Nonetheless, independent SR 2D images are often inconsistent across different views. We thus propose Iterative 3D Synchronization (I3DS) to mitigate the inconsistency problem via the inherent multi-view consistency property of NeRF. Specifically, our I3DS alternates between upscaling low-resolution (LR) rendered images with diffusion models, and updating the underlying 3D representation with standard NeRF training. We further introduce Renoised Score Distillation (RSD), a novel score-distillation objective for 2D image resolution. Our RSD combines features from ancestral sampling and Score Distillation Sampling (SDS) to generate sharp images that are also LR-consistent. Qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our DiSR-NeRF can achieve better results on NeRF super-resolution compared with existing works. Code and video results available at the project website.	This paper proposes DiSR-NeRF, a diffusion-guided framework for view-consistent super-resolution (SR) of Neural Radiance Fields (NeRFs) that enhances the resolution of NeRFs trained on low-resolution images without requiring high-resolution reference images.	Super-resolution NeRFs have practical applications in scenarios where high-resolution multi-view images are unavailable (e.g., drones, CCTVs) but existing methods require high-resolution references or datasets, which are often costly or impractical to obtain.	DiSR-NeRF leverages pre-trained 2D super-resolution diffusion models and introduces two key components: 1) Iterative 3D Synchronization (I3DS) to address cross-view inconsistency by alternating between upscaling rendered low-resolution images and refining the 3D representation. 2) Renoised Score Distillation (RSD) to generate sharp and consistent super-resolution images by optimizing denoised latents within an ancestral sampling trajectory.	DiSR-NeRF generates sharper and more detailed super-resolution NeRFs compared to existing methods, as demonstrated on synthetic and real-world datasets. RSD effectively produces high-resolution details while maintaining consistency with the original low-resolution input, outperforming both ancestral sampling and Score Distillation Sampling (SDS). I3DS significantly improves view consistency in super-resolution NeRFs compared to using only SDS optimization.	The upscaling factor is limited by the specific 2D super-resolution diffusion model used (4x in this case). Future work can explore cascaded diffusion models for higher upscaling factors.	neural radiance fields, nerf, super-resolution, diffusion models, view synthesis
2404.00661 Report	DeeDSR: Towards Real-World Image Super-Resolution via Degradation-Aware Stable Diffusion	Chunyang Bi, Xin Luo, Sheng Shen, Mengxi Zhang, Huanjing Yue, Jingyu Yang	Diffusion models, known for their powerful generative capabilities, play a crucial role in addressing real-world super-resolution challenges. However, these models often focus on improving local textures while neglecting the impacts of global degradation, which can significantly reduce semantic fidelity and lead to inaccurate reconstructions and suboptimal super-resolution performance. To address this issue, we introduce a novel two-stage, degradation-aware framework that enhances the diffusion model's ability to recognize content and degradation in low-resolution images. In the first stage, we employ unsupervised contrastive learning to obtain representations of image degradations. In the second stage, we integrate a degradation-aware module into a simplified ControlNet, enabling flexible adaptation to various degradations based on the learned representations. Furthermore, we decompose the degradation-aware features into global semantics and local details branches, which are then injected into the diffusion denoising module to modulate the target generation. Our method effectively recovers semantically precise and photorealistic details, particularly under significant degradation conditions, demonstrating state-of-the-art performance across various benchmarks. Codes will be released at https://github.com/bichunyang419/DeeDSR.	Introduces DeeDSR, a novel two-stage degradation-aware framework for real-world image super-resolution that enhances the generative capabilities of pre-trained text-to-image diffusion models by leveraging image prompts to represent global degradation.	Addresses limitations in existing diffusion-based super-resolution models that neglect the impact of global degradation, leading to inaccurate reconstructions and reduced semantic fidelity, especially under severe degradation conditions.	Employs unsupervised contrastive learning in the first stage to learn representations of image degradations. Integrates a degradation-aware module into a simplified ControlNet in the second stage to adapt to various degradations based on learned representations. Decomposes degradation-aware features into global and local branches, injecting them into the diffusion denoising module for modulated target generation.	DeeDSR effectively recovers semantically accurate details, particularly under significant degradation, outperforming existing methods on benchmark datasets. Quantitative evaluations show superior performance in perceptual metrics, including CLIPIQA and MANIQA, indicating high image generation quality and fidelity. Ablation studies confirm the effectiveness of the degradation learner, global and local representation branches, and the proposed noise guidance strategy for balancing realism and fidelity.	The model exhibits slightly slower inference speed compared to some diffusion-based methods due to the additional stage for estimating degradations. Future work could explore incorporating additional priors or optimization techniques to further improve the efficiency of the proposed framework.	image super-resolution, diffusion models, degradation awareness, contrastive learning, controlnet
2404.00648 Report	SpiralMLP: A Lightweight Vision MLP Architecture	Haojie Mu, Burhan Ul Tayyab, Nicholas Chua	We present SpiralMLP, a novel architecture that introduces a Spiral FC layer as a replacement for the conventional Token Mixing approach. Differing from several existing MLP-based models that primarily emphasize axes, our Spiral FC layer is designed as a deformable convolution layer with spiral-like offsets. We further adapt Spiral FC into two variants: Self-Spiral FC and Cross-Spiral FC, which enable both local and global feature integration seamlessly, eliminating the need for additional processing steps. To thoroughly investigate the effectiveness of the spiral-like offsets and validate our design, we conduct ablation studies and explore optimal configurations. In empirical tests, SpiralMLP reaches state-of-the-art performance, similar to Transformers, CNNs, and other MLPs, benchmarking on ImageNet-1k, COCO and ADE20K. SpiralMLP still maintains linear computational complexity O(HW) and is compatible with varying input image resolutions. Our study reveals that targeting the full receptive field is not essential for achieving high performance, instead, adopting a refined approach offers better results.	Proposes SpiralMLP, a lightweight vision architecture using a novel Spiral Fully-Connected (Spiral FC) layer to replace traditional Token Mixing in MLP-based models.	Aims to address limitations of existing MLPs, such as quadratic computational complexity and fixed input size, while improving spatial information integration for better performance.	Introduces Spiral FC, inspired by deformable convolution and spiral patterns observed in attention visualizations, using spiral-like offsets to capture local and global features with linear complexity.	Achieves state-of-the-art accuracy on ImageNet-1k, surpassing comparable MLPs and remaining competitive with Transformers and CNNs. Demonstrates strong performance in object detection, instance segmentation (COCO), and semantic segmentation (ADE20K) tasks. Exhibits faster inference latency compared to other MLPs of similar model size.	Discrete hyperparameter optimization leaves room for further exploration of optimal configurations. Future work includes investigating a dynamic version of Spiral FC for enhanced adaptability and efficiency.	mlp, lightweight vision model, spiral fully-connected layer, deformable convolution, spatial information integration
2404.00485 Report	DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans	Akash Sengupta, Thiemo Alldieck, Nikos Kolotouros, Enric Corona, Andrei Zanfir, Cristian Sminchisescu	We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.	Presents DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image using a conditional diffusion model that predicts a distribution over 3D reconstructions.	Addresses the limitations of deterministic methods that output a single solution, often lacking detail and blurriness in unseen regions, by predicting a probability distribution over plausible 3D human reconstructions.	Implements a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation, and introduces a generator network to approximate rendering for faster inference.	Produces diverse and detailed reconstructions for unseen or uncertain regions, such as the back of a person. Remains competitive with state-of-the-art methods in reconstructing visible surfaces. Offers a significant speed-up in inference time compared to diffusion-via-rendering approaches.	Currently requires training data with known 3D geometry, limiting the amount of usable data. Future work aims to leverage data with partial 2D and 2.5D supervision to overcome training data limitations.	3d human reconstruction, diffusion models, probabilistic modeling, implicit surfaces, photorealistic rendering
2404.00409 Report	3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting	Xiaoyang Lyu, Yang-Tian Sun, Yi-Hua Huang, Xiuzhe Wu, Ziyi Yang, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi	In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D reconstruction with intricate details while inheriting the high efficiency and rendering quality of 3DGS. The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized. First, we introduce a differentiable SDF-to-opacity transformation function that converts SDF values into corresponding Gaussians' opacities. This function connects the SDF and 3D Gaussians, allowing for unified optimization and enforcing surface constraints on the 3D Gaussians. During learning, optimizing the 3D Gaussians provides supervisory signals for SDF learning, enabling the reconstruction of intricate details. However, this only provides sparse supervisory signals to the SDF at locations occupied by Gaussians, which is insufficient for learning a continuous SDF. Then, to address this limitation, we incorporate volumetric rendering and align the rendered geometric attributes (depth, normal) with those derived from 3D Gaussians. This consistency regularization introduces supervisory signals to locations not covered by discrete 3D Gaussians, effectively eliminating redundant surfaces outside the Gaussian sampling range. Our extensive experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS. Besides, our method competes favorably with leading surface reconstruction techniques while offering a more efficient learning process and much better rendering qualities. The code will be available at https://github.com/CVMI-Lab/3DGSR.	Presents 3DGSR, a novel implicit surface reconstruction method leveraging 3D Gaussian Splatting (3DGS) to achieve accurate 3D reconstructions with intricate details while retaining the high efficiency and rendering quality of 3DGS.	Addresses the limitations of 3DGS in faithfully representing 3D surfaces due to its unstructured point-based geometry representation by incorporating a neural implicit signed distance field (SDF) within Gaussians for geometry modeling.	Introduces a differentiable SDF-to-opacity transformation function to connect SDF and Gaussians, enabling joint optimization and enforcing surface constraints. Incorporates volumetric rendering and aligns rendered geometric attributes (depth, normal) with those derived from 3D Gaussians, providing regularization to locations not covered by Gaussians and eliminating redundant surfaces.	Achieves high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS. Outperforms leading surface reconstruction techniques on various datasets in terms of rendering quality and reconstruction accuracy. Offers a more efficient learning process and superior rendering qualities compared to existing methods.	Trade-off between rendering quality and surface smoothness: high-quality rendering may lead to compromised surface smoothness in cases with complex textures. Potential limitations in handling scenes with extreme view changes or severe occlusions, as the method relies on multi-view consistency.	3d gaussian splatting, implicit surface reconstruction, signed distance function, volumetric rendering, novel view synthesis
2404.00384 Report	TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias	Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim	We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.	This paper identifies and addresses the "single tag bias" in CLIP-based models, where the models overly focus on a single tag in image-text relationships.	Addressing this bias is crucial for improving the accuracy and reliability of CLIP-based models in downstream tasks like multi-tag classification and segmentation.	The paper proposes Text-Tag Self-Distillation (TTD), a two-step fine-tuning approach: 1) selecting image-relevant tags from text based on pixel-tag similarity and 2) using these tags to guide the model towards a more holistic understanding of the image-text relationship.	TTD effectively mitigates single tag bias, leading to improved performance in multi-tag selection compared to methods relying on external NLP models. Fine-tuning with TTD enhances text-level segmentation performance, as demonstrated by higher CaptionIoU scores and reduced false positive/negative rates. TTD boosts open-vocabulary semantic segmentation performance, achieving competitive results on benchmarks like Pascal VOC and COCO-Object.	The performance difference with some methods on datasets with a large number of classes suggests potential improvements in incorporating richer tag information during fine-tuning. Future work could investigate the underlying causes of single tag bias in CLIP's training process.	image-text alignment, clip, self-distillation, open-vocabulary segmentation, multi-tag classification
2404.00358 Report	Spread Your Wings: A Radial Strip Transformer for Image Deblurring	Duosheng Chen, Shihao Zhou, Jinshan Pan, Jinglei Shi, Lishen Qu, Jufeng Yang	Exploring motion information is important for the motion deblurring task. Recent the window-based transformer approaches have achieved decent performance in image deblurring. Note that the motion causing blurry results is usually composed of translation and rotation movements and the window-shift operation in the Cartesian coordinate system by the window-based transformer approaches only directly explores translation motion in orthogonal directions. Thus, these methods have the limitation of modeling the rotation part. To alleviate this problem, we introduce the polar coordinate-based transformer, which has the angles and distance to explore rotation motion and translation information together. In this paper, we propose a Radial Strip Transformer (RST), which is a transformer-based architecture that restores the blur images in a polar coordinate system instead of a Cartesian one. RST contains a dynamic radial embedding module (DRE) to extract the shallow feature by a radial deformable convolution. We design a polar mask layer to generate the offsets for the deformable convolution, which can reshape the convolution kernel along the radius to better capture the rotation motion information. Furthermore, we proposed a radial strip attention solver (RSAS) as deep feature extraction, where the relationship of windows is organized by azimuth and radius. This attention module contains radial strip windows to reweight image features in the polar coordinate, which preserves more useful information in rotation and translation motion together for better recovering the sharp images. Experimental results on six synthesis and real-world datasets prove that our method performs favorably against other SOTA methods for the image deblurring task.	This paper proposes Radial Strip Transformer (RST), an efficient polar coordinate-based transformer architecture for image deblurring, addressing the limitations of Cartesian coordinate systems in modeling rotation motion blur.	Existing window-based transformer deblurring methods struggle to effectively model rotation motion blur due to their reliance on the Cartesian coordinate system. RST overcomes this limitation by operating in the polar coordinate system, enabling it to better capture both translation and rotation motion information for improved deblurring performance.	RST employs a dynamic radial embedding (DRE) module for extracting shallow features using a polar mask and deformable convolution. This is followed by a radial strip attention solver (RSAS) with strip windows along the radius and angular relative position encoding for deep feature extraction. The architecture follows an asymmetric encoder-decoder design, with RSAS applied only in the decoder for efficiency.	RST outperforms state-of-the-art methods on five synthetic and real-world datasets (GoPro, HIDE, RealBlur, REDS, RSBlur), demonstrating its superior deblurring capability. The proposed DRE and RSAS modules contribute significantly to RST's performance, highlighting their effectiveness in capturing motion information. RST achieves a favorable balance between computational efficiency and deblurring performance, exhibiting lower or comparable complexity compared to existing methods.	Limited cross-window interactions due to the use of radial strip windows. Reduced deblurring capacity for heavy blur in complex real-world scenarios.	image deblurring, transformer, motion information, polar coordinate system, deformable convolution
2404.00345 Report	MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text	Takayuki Hara, Tatsuya Harada	The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.	This paper proposes MaGRITTe, a method for controlling and generating 3D scenes from partial images, layout information (floor plans or terrain maps), and text prompts.	Generating 3D scenes from user specifications is crucial for various applications, and existing methods struggle to integrate multiple control modalities effectively.	MaGRITTe first converts partial images and layouts into a common equirectangular projection (ERP) format. Then, a fine-tuned text-to-image diffusion model generates a 360° RGB image, leveraging these inputs and text prompts. Finally, layout-conditioned depth estimation and NeRF training produce a navigable 3D scene.	MaGRITTe generates consistent and controllable 3D scenes reflecting input conditions. Fine-tuning large text-to-image models with small, targeted datasets proves effective for this task. The method handles both indoor and outdoor scenes by adapting layout representations.	MaGRITTe may struggle to separate overlapping objects specified in the layout. There are limitations in specifying areas where objects should not exist. Future work includes detecting and resolving inconsistencies between input conditions.	3d scene generation, 360-degree image generation, text-to-3d, layout-to-3d, image outpainting
2404.00269 Report	IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images	Yushuang Wu, Luyue Shi, Junhao Cai, Weihao Yuan, Lingteng Qiu, Zilong Dong, Liefeng Bo, Shuguang Cui, Xiaoguang Han	Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD.	Proposes IPoD, a novel method integrating implicit field learning with point diffusion for generalizable 3D object reconstruction from single RGB-D images.	Addresses limitations of pure implicit field learning methods, which require dense query-supervision and struggle with fine details, by leveraging diffusion models for adaptive query point positioning.	Treats query points as a noisy point cloud, iteratively denoising them using a diffusion model while concurrently predicting implicit values (UDF) to refine the shape. Employs a self-conditioning mechanism using predicted UDF values to guide the denoising process.	Achieves 7.8% improvement in F-score and 28.6% in Chamfer distance over previous state-of-the-art methods on CO3D-v2 dataset. Demonstrates superior reconstruction quality for both coarse shapes and fine details. Shows generalizability to unseen object categories in CO3D-v2 and MVImgNet datasets.	Effectiveness on 3D human and scene reconstruction not yet validated. Future work includes exploring applications in human and scene reconstruction, addressing challenges like fine-grained details and severe occlusion.	3d reconstruction, diffusion models, implicit field learning, single-view reconstruction, rgb-d images
2404.00262 Report	Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation	Yuan Wang, Rui Sun, Naisong Luo, Yuwen Pan, Tianzhu Zhang	Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions. However, most previous best-performing methods, whether pixel grouping methods or region recognition methods, suffer from false matches between image features and category labels. We attribute this to the natural gap between the textual features and visual features. In this work, we rethink how to mitigate false matches from the perspective of image-to-image matching and propose a novel relation-aware intra-modal matching (RIM) framework for OVS based on visual foundation models. RIM achieves robust region classification by firstly constructing diverse image-modal reference features and then matching them with region features based on relation-aware ranking distribution. The proposed RIM enjoys several merits. First, the intra-modal reference features are better aligned, circumventing potential ambiguities that may arise in cross-modal matching. Second, the ranking-based matching process harnesses the structure information implicit in the inter-class relationships, making it more robust than comparing individually. Extensive experiments on three benchmarks demonstrate that RIM outperforms previous state-of-the-art methods by large margins, obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark.	This paper proposes RIM, a training-free open-vocabulary semantic segmentation framework that leverages the intra-modal matching between image features, outperforming previous state-of-the-art methods.	Existing open-vocabulary segmentation methods struggle with false matches between image and category features due to the inherent gap between visual and textual representations.	RIM utilizes Stable Diffusion and Segment Anything Model (SAM) to construct image-based category reference features. It then performs relation-aware matching based on ranking distribution in the DINOv2 feature space.	RIM achieves a significant performance improvement over existing zero-shot OVS methods, particularly a 20.4% mIoU gain over SimSeg on COCO Object. The study validates the effectiveness of intra-modal matching over traditional cross-modal approaches for region classification. The proposed relation-aware matching strategy, incorporating inter-class relationships, further enhances segmentation accuracy by reducing misclassifications.	The reliance on multiple foundation models introduces computational complexity. Future work could explore incorporating temporal information for video segmentation.	open-vocabulary semantic segmentation, intra-modal matching, visual foundation models, stable diffusion, segment anything model
2404.00234 Report	Grid Diffusion Models for Text-to-Video Generation	Taegyeong Lee, Soyeong Kwon, Taehwan Kim	Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.	This paper introduces a novel grid diffusion model for text-to-video generation, which represents videos as grid images to reduce computational cost and reliance on large text-video paired datasets.	Generating videos from text is computationally expensive and often requires large, paired datasets, which this method aims to address.	The method uses two stages: (1) key grid image generation by fine-tuning a pre-trained text-to-image diffusion model on a small dataset of grid images representing key video frames; (2) autoregressive grid image interpolation to generate intermediate frames while maintaining temporal consistency.	The model outperforms existing text-to-video generation models on standard benchmarks (MSR-VTT, UCF-101) in terms of CLIP similarity, FVD, and Inception Score, even with less training data. It generates higher-quality videos with better text alignment according to human evaluation. The approach maintains a fixed GPU memory footprint regardless of the number of frames generated, showcasing its efficiency.	The model's reliance on a pre-trained text-to-image model might limit its ability to generate novel or highly complex visual content. Future work could explore applying this method to other generative tasks involving different modalities, such as sound.	text-to-video generation, diffusion models, grid images, temporal consistency, computational efficiency
2404.00230 Report	Latent Watermark: Inject and Detect Watermarks in Latent Diffusion Space	Zheling Meng, Bo Peng, Jing Dong	Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of watermark robustness and image quality. The reason for this dilemma is that watermark detection is performed in pixel space, implying an intrinsic link between image quality and watermark robustness. In this paper, we highlight that an effective solution to the problem is to both inject and detect watermarks in latent space, and propose Latent Watermark (LW) with a progressive training strategy. Experiments show that compared to the recently proposed methods such as StegaStamp, StableSignature, RoSteALS and TreeRing, LW not only surpasses them in terms of robustness but also offers superior image quality. When we inject 64-bit messages, LW can achieve an identification performance close to 100% and an attribution performance above 97% under 9 single-attack scenarios and one all-attack scenario. Our code will be available on GitHub.	This paper proposes Latent Watermark (LW), a method for watermarking images generated by latent diffusion models, that injects and detects watermarks directly in the latent space.	Addressing the critical need for identifying and attributing images generated by AI models, especially given the potential for misuse like spreading misinformation.	LW uses a message encoder/decoder, coupler, and decoupler, all trained with a three-step progressive strategy. This strategy ensures minimal impact on image quality while enabling robust watermark embedding.	LW demonstrates superior image quality compared to existing methods, showing minimal differences from non-watermarked images across metrics like FID, SSIM, NIQE, and PIQE. It exhibits significantly stronger robustness against various attacks, including destructive, constructive, and reconstructive attacks, achieving high Bit Accuracy and TPR@0.01FPR. The method is environmentally friendly, with a training process that results in significantly lower CO2 emissions compared to training the generative model itself.	The current work focuses on image watermarking; further investigation is needed to extend its applicability to other generative frameworks like GANs. Exploring different latent space manipulation techniques within LW could lead to even more robust and imperceptible watermarking.	latent diffusion model, watermarking, image attribution, information security, aigc
2403.20312 Report	Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations	Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati	Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.	This paper exposes the weakness of current vision-language models (VLMs) in understanding negations in text descriptions, which limits their ability for accurate image-text matching and reasoning. To address this, the authors introduce a new dataset, CC-Neg, and a novel training framework, CoN-CLIP.	Understanding negations is crucial for VLMs as it enables finer-grained control over semantic matching, leading to improvements in various tasks like image-text retrieval, text-to-image generation, and zero-shot image classification.	The authors create CC-Neg, a large-scale dataset with image-caption pairs and their corresponding negated captions. They then propose CoN-CLIP, which fine-tunes CLIP's text encoder using a modified contrastive loss incorporating negated captions and distractor images.	CoN-CLIP significantly outperforms existing VLMs on CC-Neg, demonstrating a strong grasp of negation in textual descriptions. CoN-CLIP exhibits enhanced zero-shot image classification accuracy across 8 different datasets, indicating improved semantic understanding. CoN-CLIP shows superior performance on the SugarCREPE benchmark, demonstrating emergent compositional understanding of objects, attributes, and relations.	The generation of negated captions relies heavily on the capabilities and potential biases of the chosen large language model. Future work can investigate the generalization of CoN-CLIP to more nuanced forms of negation and explore its application in other multimodal domains.	vision-language models, compositionality, multimodal learning, contrastive learning, negation understanding
2403.20309 Report	InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds	Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, Yue Wang	While novel view synthesis (NVS) has made substantial progress in 3D computer vision, it typically requires an initial estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free and sparse view challenges. Our framework, InstantSplat, unifies dense stereo priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparseview & pose-free images in less than 1 minute. Specifically, InstantSplat comprises a Coarse Geometric Initialization (CGI) module that swiftly establishes a preliminary scene structure and camera parameters across all training views, utilizing globally-aligned 3D point maps derived from a pre-trained dense stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO) module, which jointly optimizes the 3D Gaussian attributes and the initialized poses with pose regularization. Experiments conducted on the large-scale outdoor Tanks & Temples datasets demonstrate that InstantSplat significantly improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error (ATE) by 80%. These establish InstantSplat as a viable solution for scenarios involving posefree and sparse-view conditions. Project page: instantsplat.github.io.	Introduced InstantSplat, an efficient framework for simultaneous pose estimation and novel view synthesis from sparse, unposed images, utilizing 3D priors from a dense stereo model.	Addresses the limitations of traditional NVS methods that require pre-computed camera parameters and dense views, enabling casual capture scenarios.	Employs a two-stage approach: 1) Coarse Geometric Initialization using DUSt3R for preliminary scene structure and camera parameters. 2) Fast 3D-Gaussian Optimization to refine scene attributes and camera extrinsics.	Achieves high rendering quality, outperforming baselines in SSIM and LPIPS on Tanks & Temples and MVImgNet datasets. Demonstrates accurate pose estimation, with lower ATE and RPE compared to pose-free methods. Significantly faster than existing techniques, reconstructing scenes in under a minute.	Assumes a single-camera setup, limiting its applicability to multi-view stereo scenarios. Relies on the accuracy of the pre-trained dense stereo model, which can impact overall performance. Future work can explore online refinement of both the 3D prior and Gaussian attributes.	novel view synthesis, pose estimation, 3d gaussian splatting, dense stereo, sparse view
2403.20275 Report	Snap-it, Tap-it, Splat-it: Tactile-Informed 3D Gaussian Splatting for Reconstructing Challenging Surfaces	Mauro Comi, Alessio Tonioni, Max Yang, Jonathan Tremblay, Valts Blukis, Yijiong Lin, Nathan F. Lepora, Laurence Aitchison	Touch and vision go hand in hand, mutually enhancing our ability to understand the world. From a research perspective, the problem of mixing touch and vision is underexplored and presents interesting challenges. To this end, we propose Tactile-Informed 3DGS, a novel approach that incorporates touch data (local depth maps) with multi-view vision data to achieve surface reconstruction and novel view synthesis. Our method optimises 3D Gaussian primitives to accurately model the object's geometry at points of contact. By creating a framework that decreases the transmittance at touch locations, we achieve a refined surface reconstruction, ensuring a uniformly smooth depth map. Touch is particularly useful when considering non-Lambertian objects (e.g. shiny or reflective surfaces) since contemporary methods tend to fail to reconstruct with fidelity specular highlights. By combining vision and tactile sensing, we achieve more accurate geometry reconstructions with fewer images than prior methods. We conduct evaluation on objects with glossy and reflective surfaces and demonstrate the effectiveness of our approach, offering significant improvements in reconstruction quality.	Introduces Tactile-Informed 3DGS, a novel approach that integrates tactile sensing (local depth maps) with multi-view RGB data for enhanced 3D object reconstruction and novel view synthesis, particularly effective for challenging surfaces like glossy and reflective objects.	Addresses limitations of vision-only methods that struggle with non-Lambertian surfaces and limited viewpoints, leveraging tactile sensing's robustness to lighting variations and sparse yet accurate geometric information.	Optimizes 3D Gaussian primitives within a 3D Gaussian Splatting framework, guided by: (1) Photometric loss from multi-view images, (2) 3D transmittance loss minimized at touch locations, (3) Unsupervised edge-aware smoothness loss with proximity-based masking to refine reconstruction beyond contact areas.	Achieves state-of-the-art geometry reconstruction on glossy/reflective surfaces, outperforming NeRF-based methods in speed (1 hour vs. 25 hours). Significantly improves reconstruction quality and novel view synthesis with minimal views (5 views) compared to 3DGS and NeRO. Demonstrates consistent improvement with increasing touch interactions, validating the effectiveness of tactile data integration.	Current random touch sampling could be improved with an adaptive strategy to complement visual data more effectively. Future work could explore the application of multimodal interaction for reconstructing transparent objects and integrating surface modeling techniques.	3d reconstruction, novel view synthesis, tactile sensing, 3d gaussian splatting, non-lambertian surfaces
2403.20271 Report	Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want	Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li	The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.	This paper introduces SPHINX-V, a novel multimodal large language model (MLLM) designed for enhanced pixel-level image understanding through visual prompting, supporting various prompt types like points, boxes, and free-form shapes.	Current MLLMs primarily focus on comprehending entire images, limiting their ability to address user queries about specific regions or details within an image. SPHINX-V aims to overcome this limitation and enable more precise, pixel-level understanding.	SPHINX-V uses a visual prompt encoder and a two-stage training strategy: 1) pre-training for image-visual prompt-text alignment and 2) supervised fine-tuning on a multi-domain dataset (MDVP-Data) with instructions for various tasks like captioning, relationship analysis, and reasoning.	SPHINX-V demonstrates state-of-the-art performance on referring object classification tasks, surpassing previous methods on LVIS and PACO datasets. It excels in regional optical character recognition (OCR), significantly outperforming baseline models on the COCO-Text dataset. SPHINX-V achieves high scores on region-level captioning tasks, as well as comprehensive assessments using LLaVA-Bench, Ferret-Bench, and the proposed MDVP-Bench.	The model's performance on image-level understanding tasks could be further improved by incorporating more open-source image-level VQA data during training. Future work could focus on enhancing the visual prompt encoder to better distinguish and model different types of visual prompts.	multimodal large language model, visual prompting, pixel-level understanding, region-level captioning, optical character recognition
2403.20249 Report	Relation Rectification in Diffusion Model	Yinwei Wu, Xingyi Yang, Xinchao Wang	Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/.	Introduces Relation Rectification, a novel task to improve the accuracy of directional relationships depicted in images generated by T2I diffusion models, and proposes RRNet, a HGCN-based framework to address it.	Large T2I diffusion models often struggle to accurately depict visual relationships between objects due to limitations in interpreting directional or relational terms in text prompts, treating them as 'Bags-of words'.	RRNet models object-swapped prompts (OSPs) as heterogeneous graphs to capture directional relationships. It leverages HGCN to generate adjustment vectors that refine the text embeddings, particularly the [EOT] token embedding, to guide the diffusion model towards generating images with correct relationship directions. The model is trained using a combination of positive (denoising) and negative losses to ensure accurate relationship representation and disentanglement of object features from relationships.	RRNet significantly improves the accuracy of relationship generation in SD by up to 25%, as evidenced by evaluation using vision-language chatbots. The approach enhances the interpretability of generated images, allowing for clear depiction of directional transitions in relationships. RRNet demonstrates robust generalization capabilities, effectively handling even objects unseen during training.	RRNet's performance is limited by the diffusion model's pre-existing knowledge, struggling with relationships involving unseen concepts. Extending RRNet to handle more complex, multi-relational scenarios requires further investigation, particularly in managing multiple adjustment vectors without introducing semantic confusion.	text-to-image synthesis, diffusion models, relation rectification, heterogeneous graph convolutional network, vision-language models
2403.20236 Report	Long-Tailed Anomaly Detection with Learnable Class Names	Chih-Hui Ho, Kuan-Chuan Peng, Nuno Vasconcelos	Anomaly detection (AD) aims to identify defective images and localize their defects (if any). Ideally, AD models should be able to detect defects over many image classes; without relying on hard-coded class names that can be uninformative or inconsistent across datasets; learn without anomaly supervision; and be robust to the long-tailed distributions of real-world applications. To address these challenges, we formulate the problem of long-tailed AD by introducing several datasets with different levels of class imbalance and metrics for performance evaluation. We then propose a novel method, LTAD, to detect defects from multiple and long-tailed classes, without relying on dataset class names. LTAD combines AD by reconstruction and semantic AD modules. AD by reconstruction is implemented with a transformer-based reconstruction module. Semantic AD is implemented with a binary classifier, which relies on learned pseudo class names and a pretrained foundation model. These modules are learned over two phases. Phase 1 learns the pseudo-class names and a variational autoencoder (VAE) for feature synthesis that augments the training data to combat long-tails. Phase 2 then learns the parameters of the reconstruction and classification modules of LTAD. Extensive experiments using the proposed long-tailed datasets show that LTAD substantially outperforms the state-of-the-art methods for most forms of dataset imbalance. The long-tailed dataset split is available at https://zenodo.org/records/10854201 .	This paper introduces the task of long-tailed anomaly detection (LTAD) where training datasets exhibit class imbalance.	Prior anomaly detection methods, designed for balanced datasets, struggle in real-world scenarios with skewed class distributions common in manufacturing.	The paper proposes LTAD, a new method combining reconstruction-based anomaly detection with semantic anomaly detection. It uses a data augmentation strategy based on a class-sensitive VAE and learns pseudo class names to overcome ambiguity of real class names.	LTAD consistently outperforms state-of-the-art anomaly detection methods on long-tailed versions of MVTec, VisA, and DAGM datasets. Both reconstruction and semantic anomaly detection modules contribute to LTAD's superior performance. Learned pseudo class names prove more effective than real class names, highlighting the ability to handle class ambiguity.	The paper relies on a single pretrained foundational model (ALIGN) and doesn't explore the effect of other models. Future work includes investigating alternative data augmentation strategies beyond VAE.	anomaly detection, long-tailed learning, data augmentation, computer vision, semantic anomaly detection
2403.20231 Report	U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation	You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, Jintao Li	Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.	This paper introduces U-VAP, a novel method for user-specified visual appearance personalization in text-to-image generation that allows control over fine-grained attributes (e.g., color, pattern, structure) from reference images.	Existing personalization methods struggle to disentangle fine-grained visual attributes within a concept, limiting controllability in combining specific appearances with new concepts.	U-VAP employs a decoupled self-augmentation strategy. After an initial personalization, it uses an LLM to generate target- and non-target-specific text prompts. These prompts generate augmented image sets, further fine-tuning the model to learn and disentangle the desired attributes. Semantic adjustment during inference enhances disentanglement.	U-VAP enables controlled and accurate personalization of specific visual attributes, as demonstrated through quantitative and qualitative comparisons with state-of-the-art methods. The method exhibits flexibility in applying learned attributes to various novel concepts. User studies confirm U-VAP's superiority in generating personalized images with high fidelity to both the specified attribute and the new concept.	U-VAP's performance depends on the capability of the base personalization method used in pre-learning, potentially limiting disentanglement effectiveness. Strong prior information associated with certain words in the inference prompt might sometimes overshadow the learned target attributes.	text-to-image generation, personalization, attribute disentanglement, diffusion models, self-augmentation
2403.20193 Report	Motion Inversion for Video Customization	Luozhou Wang, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen	In this research, we present a novel approach to motion customization in video generation, addressing the widespread gap in the thorough exploration of motion representation within video generative models. Recognizing the unique challenges posed by video's spatiotemporal nature, our method introduces Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach offers a compact and efficient solution to motion representation and enables complex manipulations of motion characteristics through vector arithmetic in the embedding space. Furthermore, we identify the Temporal Discrepancy in video generative models, which refers to variations in how different motion modules process temporal relationships between frames. We leverage this understanding to optimize the integration of our motion embeddings. Our contributions include the introduction of a tailored motion embedding for customization tasks, insights into the temporal processing differences in video models, and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.	This work introduces motion embeddings for video diffusion models, enabling the isolation and manipulation of motion from a source video, facilitating motion transfer to different text-guided generations.	Directly manipulating motion in text-guided video generation is challenging. This work offers a way to isolate and transfer motion, enhancing control and creative possibilities in video generation.	The authors integrate motion embeddings into a video diffusion model's UNet architecture. They explore different training objectives and noise initialization strategies to optimize motion transfer for various scenarios, including camera, object, and hybrid motion.	Motion embeddings successfully isolate motion from source videos, allowing for transfer to novel text-guided generations. Different training objectives prove beneficial for specific motion types. For instance, appearance-debiased temporal loss excels in camera motion transfer. The method allows for flexible motion manipulation, including using partial motion embeddings and interpolating across frames for longer sequences.	The effectiveness of motion transfer can vary depending on the complexity of the motion and the quality of the source video. The work primarily focuses on motion representation and transfer, with potential for future exploration in combining it with advanced appearance editing techniques.	video generation, motion transfer, diffusion models, motion embeddings, text-guided synthesis
2403.20159 Report	HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes	Ke Wu, Kaizhao Zhang, Zhiwei Zhang, Shanshuai Yuan, Muer Tie, Julong Wei, Zijun Xu, Jieru Zhao, Zhongxue Gan, Wenchao Ding	Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and navigation of autonomous vehicles. Recent advancements in mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping. However, integrating 3DGS into a street-view dense mapping framework still faces two challenges, including incomplete reconstruction due to the absence of geometric information beyond the LiDAR coverage area and extensive computation for reconstruction in large urban scenes. To this end, we propose HGS-Mapping, an online dense mapping framework in unbounded large-scale scenes. To attain complete construction, our framework introduces Hybrid Gaussian Representation, which models different parts of the entire scene using Gaussians with distinct properties. Furthermore, we employ a hybrid Gaussian initialization mechanism and an adaptive update method to achieve high-fidelity and rapid reconstruction. To the best of our knowledge, we are the first to integrate Gaussian representation into online dense mapping of urban scenes. Our approach achieves SOTA reconstruction accuracy while only employing 66% number of Gaussians, leading to 20% faster reconstruction speed.	This paper proposes HGS-Mapping, the first online dense mapping framework for urban scenes using a novel 3D Gaussian Splatting-based representation.	Current NeRF-based mapping methods lack rendering speed for online applications, while existing 3DGS methods struggle with complete reconstruction and computational efficiency in large-scale urban environments.	The HGS-Mapping framework leverages a Hybrid Gaussian Representation (Sphere Gaussian for sky, 2D Gaussian Plane for roads, and 3D Gaussian for scenery). It employs a hybrid Gaussian initialization mechanism (combining LiDAR and feature matching) and an adaptive update method (silhouette filtering, densify control, and importance pruning) for efficient and accurate reconstruction.	HGS-Mapping achieves state-of-the-art reconstruction accuracy in urban environments, outperforming NeRF and Gaussian-based baselines in rendering quality. The method demonstrates significant speed improvements, achieving 20% faster reconstruction than the current SOTA online method (SplaTAM) while using only 66% of the Gaussians. The proposed Hybrid Gaussian Representation effectively addresses sky and road modeling challenges, leading to more efficient and accurate urban scene reconstruction.	The RANSAC-based road surface extraction can be limited in scenarios with complex road geometry. Future work could explore extending the framework to handle arbitrary outdoor scenes and incorporating dynamic object representation.	gaussian splatting, dense mapping, autonomous driving, 3d reconstruction, urban scenes
2403.20153 Report	Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior	Jaehoon Ko, Kyusun Cho, Joungbin Lee, Heeji Yoon, Sangmin Lee, Sangjun Ahn, Seungryong Kim	Recent methods for audio-driven talking head synthesis often optimize neural radiance fields (NeRF) on a monocular talking portrait video, leveraging its capability to render high-fidelity and 3D-consistent novel-view frames. However, they often struggle to reconstruct complete face geometry due to the absence of comprehensive 3D information in the input monocular videos. In this paper, we introduce a novel audio-driven talking head synthesis framework, called Talk3D, that can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Given the personalized 3D generative model, we present a novel audio-guided attention U-Net architecture that predicts the dynamic face variations in the NeRF space driven by audio. Furthermore, our model is further modulated by audio-unrelated conditioning tokens which effectively disentangle variations unrelated to audio features. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses. We also conduct extensive experiments showing our approach surpasses state-of-the-art benchmarks in terms of both quantitative and qualitative evaluations.	Talk3D, a novel framework for high-fidelity 3D talking head synthesis, leverages a 3D-aware GAN prior and region-aware motion prediction.	Existing audio-driven talking head synthesis methods struggle to reconstruct complete face geometry and lack multi-view consistency, particularly from unseen viewpoints.	Talk3D uses a personalized 3D generator fine-tuned with VIVE3D and an audio-guided attention U-Net architecture to predict triplane offsets (deltaplanes) that capture audio-driven facial dynamics.	Talk3D achieves state-of-the-art results in quantitative and qualitative evaluations, outperforming previous methods in terms of image fidelity, lip synchronization accuracy, and robustness to novel viewpoints. The method successfully disentangles local variations like eye blinks, torso movements, and background motion, ensuring accurate lip-sync and realistic facial animations. Talk3D allows for facial attribute manipulation (e.g., age, hair length) by leveraging the latent space of the 3D-aware GAN.	Talk3D, relying on GAN inversion, currently exhibits limited generalizability beyond photorealistic human faces. The reliance on GAN inversion introduces data preparation complexities, requiring precise frame alignment and cropping.	talking head synthesis, neural radiance fields (nerf), 3d-aware gans, audio-driven animation, deep learning
2403.20126 Report	ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning	Beomyoung Kim, Joonsang Yu, Sung Ju Hwang	Panoptic segmentation, combining semantic and instance segmentation, stands as a cutting-edge computer vision task. Despite recent progress with deep learning models, the dynamic nature of real-world applications necessitates continual learning, where models adapt to new classes (plasticity) over time without forgetting old ones (catastrophic forgetting). Current continual segmentation methods often rely on distillation strategies like knowledge distillation and pseudo-labeling, which are effective but result in increased training complexity and computational overhead. In this paper, we introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity and significantly reducing the trainable parameters. To mitigate inherent challenges such as error propagation and semantic drift in continual segmentation, we propose logit manipulation to effectively leverage common knowledge across the classes. Experiments on ADE20K continual panoptic segmentation benchmark demonstrate the superiority of ECLIPSE, notably its robustness against catastrophic forgetting and its reasonable plasticity, achieving a new state-of-the-art. The code is available at https://github.com/clovaai/ECLIPSE.	This paper introduces ECLIPSE, a novel, efficient method for continual panoptic segmentation based on Visual Prompt Tuning. It freezes base model parameters and fine-tunes only prompt embeddings to learn new classes, mitigating catastrophic forgetting while enhancing plasticity.	Continual learning in panoptic segmentation is crucial for real-world applications that require adapting to new classes over time without forgetting old ones. Existing methods rely on distillation strategies, leading to increased complexity and overhead.	ECLIPSE freezes the base model and introduces new prompt embeddings for each set of new classes. It leverages logit manipulation, a novel strategy that leverages inter-class knowledge to address error propagation and semantic drift.	ECLIPSE achieves state-of-the-art results on ADE20K continual panoptic segmentation benchmark with only 1.3% of trainable parameters. It demonstrates superior robustness against catastrophic forgetting, especially as the number of continual steps increases. The method also effectively learns new classes, even with limited base knowledge.	The computational complexity increases with expanding prompt sets as the number of classes grows. Future work may explore optimizing the computational complexity for scenarios with a massive number of classes.	continual learning, panoptic segmentation, visual prompt tuning, logit manipulation, catastrophic forgetting
2403.20105 Report	FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models	Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord	Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial representations. In this work, we explore the potential benefits of such representations, beyond image generation, in particular, for dense visual prediction tasks. We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets, with pixel-level annotations. To avoid the annotation cost or training large diffusion models, we constraint our setup to be zero-shot and training-free. In a nutshell, our pipeline leverages different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. The pipeline is as follows: the image is passed to both a captioner model (i.e. BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text description and visual representation, respectively. The features are clustered and binarized to obtain class agnostic masks for each object. These masks are then mapped to a textual class, using the CLIP model to support open-vocabulary. Finally, we add a refinement step that allows to obtain a more precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets. In addition, we show very competitive results compared to the recent weakly-supervised segmentation approaches. We provide comprehensive experiments showing the superiority of diffusion model features compared to other pretrained models. Project page: https://bcorrad.github.io/freesegdiff/	This paper introduces FreeSeg-Diff, a zero-shot, training-free approach for open-vocabulary image segmentation leveraging pre-trained diffusion models.	This approach eliminates the need for expensive pixel-level annotations and the training of large diffusion models, potentially making image segmentation more accessible and scalable.	The method uses a pre-trained diffusion model to extract image features, clusters these features to generate class-agnostic masks, and then employs CLIP to map these masks to textual classes extracted from image captions.	FreeSeg-Diff outperforms several training-based and weakly supervised approaches on Pascal VOC and COCO datasets. The study highlights the superior semantic localization capabilities of diffusion models compared to other pre-trained models like CLIP, DINOv2, and ViT. The approach demonstrates competitive performance against recent state-of-the-art weakly supervised segmentation methods.	The performance of FreeSeg-Diff still lags behind state-of-the-art supervised segmentation approaches. The reliance on multiple models, including a large diffusion model, introduces a slight computational overhead compared to traditional segmentation models.	image segmentation, diffusion models, zero-shot learning, open-vocabulary segmentation, weakly supervised learning
2403.20079 Report	SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior	Zhongrui Yu, Haoran Wang, Jinze Yang, Hanzhang Wang, Zeke Xie, Yunfeng Cai, Jiale Cao, Zhong Ji, Mingming Sun	Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints. This issue stems from the sparse training views captured by a fixed camera on a moving vehicle. To tackle this problem, we propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model along with complementary multi-modal data. Specifically, we first fine-tune a Diffusion Model by adding images from adjacent frames as condition, meanwhile exploiting depth data from LiDAR point clouds to supply additional spatial information. Then we apply the Diffusion Model to regularize the 3DGS at unseen views during training. Experimental results validate the effectiveness of our method compared with current state-of-the-art models, and demonstrate its advance in rendering images from broader views.	This paper proposes a novel method, SGD, that leverages a fine-tuned Diffusion Model to enhance the free-view rendering capabilities of 3D Gaussian Splatting for street view synthesis.	Current neural rendering methods for street view synthesis struggle to maintain quality at viewpoints far from training views due to the limited perspective of vehicle-captured data. This limits their use in autonomous driving simulations which require high-quality rendering from diverse perspectives.	The method fine-tunes a Stable Diffusion Model on driving scenes using adjacent frames as context and LiDAR data for spatial guidance. This fine-tuned model then regularizes the 3DGS training by providing priors for unseen views.	SGD outperforms state-of-the-art methods in sparse-view settings on KITTI and KITTI-360 datasets. The method significantly improves rendering quality at novel viewpoints distant from training views. SGD preserves the real-time inference speed of 3DGS, making it suitable for driving simulations.	The integration of the Diffusion Model increases training time due to the denoising process. Future work includes exploring more efficient training strategies.	novel view synthesis, 3d gaussian splatting, diffusion models, autonomous driving simulation, sparse-view reconstruction
2403.20034 Report	NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising	Tianchen Deng, Yanbo Wang, Hongle Xie, Hesheng Wang, Jingchuan Wang, Danwei Wang, Weidong Chen	In recent years, there have been significant advancements in 3D reconstruction and dense RGB-D SLAM systems. One notable development is the application of Neural Radiance Fields (NeRF) in these systems, which utilizes implicit neural representation to encode 3D scenes. This extension of NeRF to SLAM has shown promising results. However, the depth images obtained from consumer-grade RGB-D sensors are often sparse and noisy, which poses significant challenges for 3D reconstruction and affects the accuracy of the representation of the scene geometry. Moreover, the original hierarchical feature grid with occupancy value is inaccurate for scene geometry representation. Furthermore, the existing methods select random pixels for camera tracking, which leads to inaccurate localization and is not robust in real-world indoor environments. To this end, we present NeSLAM, an advanced framework that achieves accurate and dense depth estimation, robust camera tracking, and realistic synthesis of novel views. First, a depth completion and denoising network is designed to provide dense geometry prior and guide the neural implicit representation optimization. Second, the occupancy scene representation is replaced with Signed Distance Field (SDF) hierarchical scene representation for high-quality reconstruction and view synthesis. Furthermore, we also propose a NeRF-based self-supervised feature tracking algorithm for robust real-time tracking. Experiments on various indoor datasets demonstrate the effectiveness and accuracy of the system in reconstruction, tracking quality, and novel view synthesis.	NeSLAM, a dense RGB-D SLAM system for accurate and robust 3D reconstruction and novel view synthesis using neural implicit mapping and self-supervised feature tracking.	Existing dense SLAM systems struggle with sparse, noisy depth images from consumer-grade sensors and inaccurate camera tracking in complex indoor environments. This work aims to address these limitations.	The system features a depth completion and denoising network for improved geometry prior, utilizes Signed Distance Field (SDF) for enhanced scene representation, and incorporates a NeRF-based self-supervised feature tracking algorithm for robust pose estimation.	Achieves more accurate and complete 3D reconstructions compared to existing NeRF-based SLAM methods like iMAP and NICE-SLAM. Demonstrates superior camera tracking accuracy, outperforming other NeRF-based SLAM systems and achieving competitive results compared to traditional methods like ORB-SLAM2. Generates higher-fidelity novel views with better clarity and completeness, as evidenced by qualitative and quantitative (PSNR) evaluation on various datasets.	The system is currently limited to static environments and does not handle dynamic objects. Future work will explore extending the approach to dynamic scenes and improving computational efficiency.	slam, nerf, depth completion, feature tracking, 3d reconstruction
2403.20032 Report	HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes	Zhuopeng Li, Yilin Zhang, Chenming Wu, Jianke Zhu, Liangjun Zhang	The rapid growth of 3D Gaussian Splatting (3DGS) has revolutionized neural rendering, enabling real-time production of high-quality renderings. However, the previous 3DGS-based methods have limitations in urban scenes due to reliance on initial Structure-from-Motion(SfM) points and difficulties in rendering distant, sky and low-texture areas. To overcome these challenges, we propose a hybrid optimization method named HO-Gaussian, which combines a grid-based volume with the 3DGS pipeline. HO-Gaussian eliminates the dependency on SfM point initialization, allowing for rendering of urban scenes, and incorporates the Point Densitification to enhance rendering quality in problematic regions during training. Furthermore, we introduce Gaussian Direction Encoding as an alternative for spherical harmonics in the rendering pipeline, which enables view-dependent color representation. To account for multi-camera systems, we introduce neural warping to enhance object consistency across different cameras. Experimental results on widely used autonomous driving datasets demonstrate that HO-Gaussian achieves photo-realistic rendering in real-time on multi-camera urban datasets.	This paper presents HO-Gaussian, a hybrid optimization method for novel view rendering of multi-camera urban scenes that combines a grid-based volume with a 3D Gaussian Splatting pipeline.	Existing 3D Gaussian Splatting (3DGS) methods struggle in urban scenes due to reliance on sparse SfM point initialization and difficulties in rendering distant, sky, and low-texture areas. This limits their effectiveness in large-scale urban environments.	HO-Gaussian uses a grid-based volume to learn Gaussian positions and optimize geometric information, enabling point densification in challenging areas. It introduces Gaussian directional encoding (replacing spherical harmonics) for view-dependent color representation and neural warping to enhance object consistency across multiple cameras.	HO-Gaussian achieves real-time rendering while maintaining photo-realistic texture details in urban scenes. The method reduces disk space usage compared to traditional 3DGS by employing efficient encoding techniques. Extensive evaluations on Waymo and Argoverse datasets demonstrate superior performance compared to state-of-the-art NeRF-based and 3DGS-based methods.	The current implementation relies on a predefined bounding sphere, potentially limiting scalability to even larger scenes. Future work could explore incorporating temporal information and dynamic elements for more comprehensive urban scene rendering.	novel view synthesis, urban scenes, gaussian splatting, neural rendering, hybrid optimization
2403.20018 Report	SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image	Yunhao Li, Xiaodong Wang, Ping Wang, Xin Yuan, Peidong Liu	In this paper, we explore the potential of Snapshot Compressive Imaging (SCI) technique for recovering the underlying 3D scene representation from a single temporal compressed image. SCI is a cost-effective method that enables the recording of high-dimensional data, such as hyperspectral or temporal information, into a single image using low-cost 2D imaging sensors. To achieve this, a series of specially designed 2D masks are usually employed, which not only reduces storage requirements but also offers potential privacy protection. Inspired by this, to take one step further, our approach builds upon the powerful 3D scene representation capabilities of neural radiance fields (NeRF). Specifically, we formulate the physical imaging process of SCI as part of the training of NeRF, allowing us to exploit its impressive performance in capturing complex scene structures. To assess the effectiveness of our method, we conduct extensive evaluations using both synthetic data and real data captured by our SCI system. Extensive experimental results demonstrate that our proposed approach surpasses the state-of-the-art methods in terms of image reconstruction and novel view image synthesis. Moreover, our method also exhibits the ability to restore high frame-rate multi-view consistent images by leveraging SCI and the rendering capabilities of NeRF. The code is available at https://github.com/WU-CVGL/SCINeRF.	This paper introduces SCINeRF, a novel method to recover 3D scene representations and multi-view images from a single snapshot compressed image.	This method addresses limitations of existing SCI image reconstruction techniques that do not consider 3D scene structure and multi-view consistency.	SCINeRF leverages NeRF to represent the scene and jointly optimizes NeRF parameters and camera poses by minimizing the difference between a synthesized compressed image and the actual measurement.	SCINeRF achieves superior performance over state-of-the-art SCI image restoration methods on both synthetic and real datasets. The method shows robustness to high compression ratios, maintaining high image quality even with increased compression. Experimental results demonstrate the importance of considering 3D scene structure for accurate and consistent multi-view image recovery from SCI data.	The rendering process in SCINeRF may introduce a marginal loss of image information compared to direct recovery methods. Future work will focus on improving the capturing and reconstruction speed and exploring applications in dynamic scene capture.	neural radiance fields, nerf, snapshot compressive imaging, sci, 3d scene representation
2403.20002 Report	Grounding and Enhancing Grid-based Models for Neural Fields	Zelin Zhao, Fenglei Fan, Wenlong Liao, Junchi Yan	Many contemporary studies utilize grid-based models for neural field representation, but a systematic analysis of grid-based models is still missing, hindering the improvement of those models. Therefore, this paper introduces a theoretical framework for grid-based models. This framework points out that these models' approximation and generalization behaviors are determined by grid tangent kernels (GTK), which are intrinsic properties of grid-based models. The proposed framework facilitates a consistent and systematic analysis of diverse grid-based models. Furthermore, the introduced framework motivates the development of a novel grid-based model named the Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis demonstrates that MulFAGrid exhibits a lower generalization bound than its predecessors, indicating its robust generalization performance. Empirical studies reveal that MulFAGrid achieves state-of-the-art performance in various tasks, including 2D image fitting, 3D signed distance field (SDF) reconstruction, and novel view synthesis, demonstrating superior representation ability. The project website is available at https://sites.google.com/view/cvpr24-2034-submission/home.	This paper introduces a theoretical framework for grid-based neural field models based on grid tangent kernels (GTKs), and proposes a novel model named Multiplicative Fourier Adaptive Grid (MulFAGrid).	A systematic analysis of grid-based models, which are computationally efficient for neural field representation, has been missing, hindering their improvement.	The paper introduces the concept of GTKs to analyze the training and generalization behaviors of grid-based models. It then proposes MulFAGrid, which leverages multiplicative filters and Fourier features for effective representation learning.	MulFAGrid exhibits a wider GTK spectrum in the high-frequency domain, indicating better learning efficiency for high-frequency components. Numerical studies show MulFAGrid has a tighter generalization bound than existing grid-based models. Empirical evaluations demonstrate MulFAGrid achieves state-of-the-art performance in 2D image fitting, 3D SDF reconstruction, and novel view synthesis.	The rendering speed of MulFAGrid is lower than some baselines like 3DGS. Further research on improving rendering speed and exploring other applications of the GTK theory is warranted.	neural fields, grid-based models, grid tangent kernel, multiplicative filters, fourier features
2403.19985 Report	Stable Surface Regularization for Fast Few-Shot NeRF	Byeongin Joung, Byeong-Uk Lee, Jaesung Choe, Ukcheol Shin, Minjun Kang, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon	This paper proposes an algorithm for synthesizing novel views under few-shot setup. The main concept is to develop a stable surface regularization technique called Annealing Signed Distance Function (ASDF), which anneals the surface in a coarse-to-fine manner to accelerate convergence speed. We observe that the Eikonal loss - which is a widely known geometric regularization - requires dense training signal to shape different level-sets of SDF, leading to low-fidelity results under few-shot training. In contrast, the proposed surface regularization successfully reconstructs scenes and produce high-fidelity geometry with stable training. Our method is further accelerated by utilizing grid representation and monocular geometric priors. Finally, the proposed approach is up to 45 times faster than existing few-shot novel view synthesis methods, and it produces comparable results in the ScanNet dataset and NeRF-Real dataset.	This paper introduces a novel surface regularization technique called Annealing Signed Distance Function (ASDF) for fast few-shot novel view synthesis.	Existing methods struggle with few-shot novel view synthesis due to the difficulty of extracting reliable geometry information from sparse input views, leading to unstable optimization and low-fidelity results.	The ASDF loss enforces adaptive geometric smoothing in a coarse-to-fine manner by gradually reducing the smoothing area during training. This allows the network to first learn the overall structure and then progressively recover detailed geometry. The method utilizes multi-level voxel grids, monocular geometric priors, and combines ASDF loss with rendering losses for color, depth, and surface normal.	The ASDF loss leads to more stable optimization compared to conventional Eikonal loss in few-shot scenarios. The proposed method achieves comparable performance to state-of-the-art few-shot NeRF methods while being up to 45 times faster. The approach demonstrates robustness in reconstructing and synthesizing novel views, particularly in homogeneous regions and scenes with limited viewing directions.	The Annealing Signed Distance Function (ASDF) loss requires hyperparameter tuning depending on scene geometry and SfM results. Future work could explore adaptive methods for hyperparameter selection and integrate recent advancements like hash encoding for further optimization speed improvements.	novel view synthesis, neural radiance fields (nerf), few-shot learning, surface regularization, geometric priors
2403.19975 Report	Context-Aware Integration of Language and Visual References for Natural Language Tracking	Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen	Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.	Proposes QueryNLT, a novel multi-modal tracking framework for tracking by natural language specification (TNL), which leverages the complementarity between visual and language features to improve target localization accuracy.	Existing TNL methods suffer from tracking drift due to separate language and template matching, leading to misalignment with the dynamic target state and ambiguity in merging results.	1. Prompt Modulation Module: Filters inconsistent descriptions from language and visual references to generate precise, context-aware cues. 2. Unified Target Decoding Module: Integrates multi-modal prompts and performs target retrieval from the search image in an end-to-end manner.	Achieves competitive performance against state-of-the-art trackers on TNL2K, OTB-Lang, and LaSOT benchmarks. Shows significant improvements over methods relying on separate language and template matching, highlighting the importance of multi-modal integration. Demonstrates robust performance in handling challenging factors such as appearance variations, background clutter, and similar distractors.	Limited exploration of more sophisticated language models for richer semantic understanding. Further investigation into incorporating temporal reasoning mechanisms for enhanced long-term tracking.	natural language tracking, visual tracking, multi-modal learning, prompt modulation, target decoding
2403.19967 Report	Rewrite the Stars	Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, Yun Fu	Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design. While intuitive explanations abound, the foundational rationale behind its application remains largely unexplored. Our study attempts to reveal the star operation's ability to map inputs into high-dimensional, non-linear feature spaces -- akin to kernel tricks -- without widening the network. We further introduce StarNet, a simple yet powerful prototype, demonstrating impressive performance and low latency under compact network structure and efficient budget. Like stars in the sky, the star operation appears unremarkable but holds a vast universe of potential. Our work encourages further exploration across tasks, with codes available at https://github.com/ma-xu/Rewrite-the-Stars.	This paper investigates the "star" operation (element-wise multiplication) in neural networks, showing it implicitly maps inputs to high-dimensional, non-linear feature spaces similar to kernel methods.	Understanding the star operation's power can lead to more efficient and compact network designs.	The authors analyze the star operation mathematically, rewrite it to reveal its dimensionality expansion, and compare it to summation in various experiments with a simple network (DemoNet). They also introduce StarNet, a proof-of-concept efficient architecture based on these insights.	Star operation consistently outperforms summation in image classification, especially with narrower networks. Visualizations of decision boundaries show the star operation allows for more complex representations, similar to polynomial kernels in SVMs. StarNet achieves competitive performance on ImageNet while being significantly faster than other efficient models with similar complexity.	The study primarily focuses on image classification, leaving its generalization to other tasks for future work. While the paper demonstrates the potential of activation-free networks with star operations, further research is needed to fully realize this.	element-wise multiplication, star operation, kernel methods, efficient networks, high-dimensional feature spaces
2403.19964 Report	FairRAG: Fair Human Generation via Fair Retrieval Augmentation	Robik Shrestha, Yang Zou, Qiuyu Chen, Zhiheng Li, Yusheng Xie, Siqi Deng	Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work, we introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity, image-text alignment, and image fidelity while incurring minimal computational overhead during inference.	Introduces Fair Retrieval Augmented Generation (FRAG), a framework that uses retrieved reference images to improve demographic diversity in human image generation, addressing biases in pre-trained text-to-image models.	Existing text-to-image models perpetuate societal biases, particularly against certain demographic groups, necessitating fairer generation methods.	FRAG trains a linear layer to project reference images into the textual space of a frozen pre-trained model. It employs debiasing techniques like debiased queries and balanced sampling for fair retrieval, and uses a transfer instruction to guide attribute transfer during generation.	FRAG outperforms baselines in demographic diversity across various professions, improving from 0.341 to 0.438 compared to the best non-RAG method. It also shows improvement in image-text alignment and maintains competitive image fidelity. The framework incurs minimal computational overhead, adding just 0.2 seconds to generate an image compared to the baseline Stable Diffusion model.	Current implementation uses a one-to-one image mapping, exploring multiple reference images for conditioning could further enhance diversity. Generated images can still exhibit disfigurements, suggesting the need for incorporating human anatomy knowledge into the models.	fairness, text-to-image generation, retrieval augmented generation, demographic diversity, bias mitigation
2403.19963 Report	Efficient Modulation for Vision Networks	Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu Yuan	In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Benefiting from the prominent representational ability of modulation mechanism and the proposed efficient design, our network can accomplish better trade-offs between accuracy and efficiency and set new state-of-the-art performance in the zoo of efficient networks. When integrating EfficientMod with the vanilla self-attention block, we obtain the hybrid architecture which further improves the performance without loss of efficiency. We carry out comprehensive experiments to verify EfficientMod's performance. With fewer parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than EfficientFormerV2-s2 and is 25% faster on GPU, and 2.9 better than MobileViTv2-1.0 at the same GPU latency. Additionally, our method presents a notable improvement in downstream tasks, outperforming EfficientFormerV2-s by 3.6 mIoU on the ADE20K benchmark. Code and checkpoints are available at https://github.com/ma-xu/EfficientMod.	This paper proposes Efficient Modulation (EfficientMod), a novel convolutional block designed for efficient vision networks. EfficientMod leverages a modulation mechanism with tailored context modeling and feature projection for enhanced efficiency.	Existing efficient networks with attention mechanisms or convolutional alternatives often suffer from high computational costs. This work addresses this by introducing an efficient modulation block that balances performance and efficiency.	The authors revisit the modulation mechanism used in FocalNet and VAN, simplifying the context modeling branch and streamlining the overall design to reduce computational overhead while retaining desirable properties like dynamics and large receptive fields.	EfficientMod achieves state-of-the-art performance on ImageNet-1K, outperforming EfficientFormerV2-S2 by 0.6% top-1 accuracy while being 25% faster on GPU. The proposed method demonstrates significant improvements in downstream tasks, surpassing EfficientFormerV2 by 3.6 mIoU on ADE20K semantic segmentation. Comprehensive ablation studies validate the contribution of each component in EfficientMod and its superiority over alternative designs like MBConv.	Further investigation is needed to explore the scalability of EfficientMod and address the latency gap observed with increasing model size. Exploring more efficient ways to expand receptive fields beyond large kernels and attention mechanisms is crucial for future work.	efficient networks, convolutional neural networks, modulation mechanism, computer vision, image classification
2403.19926 Report	Video-Based Human Pose Regression via Decoupled Space-Time Aggregation	Jijie He, Wenwu Yang	By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.	This paper presents DSTA, a novel regression-based framework for multi-person pose estimation in video sequences, which efficiently leverages temporal dependencies while reducing computational overhead common in heatmap-based methods.	Existing multi-frame pose estimation methods rely heavily on heatmaps, leading to high computation and storage costs that limit their application in real-time video scenarios, especially on edge devices. This work explores a more efficient and flexible regression-based approach for this task.	The proposed DSTA method decouples the modeling of spatial and temporal dependencies in human pose estimation. It first extracts joint-specific feature tokens from backbone features using a Joint-centric Feature Decoder (JFD). Then, a Space-Time Decoupling (STD) module with a joint-wise local-awareness attention mechanism separately captures spatial dependencies between adjacent joints and temporal dependencies of each joint across frames. Finally, aggregated spatiotemporal features are used to directly regress joint coordinates.	DSTA significantly outperforms previous image-based regression methods, demonstrating the importance of incorporating temporal information. DSTA achieves comparable or superior performance to state-of-the-art heatmap-based methods on challenging benchmarks like PoseTrack, while being significantly more computationally efficient. DSTA exhibits strong robustness to low-resolution inputs, making it particularly suitable for resource-constrained scenarios.	The performance improvement from capturing spatial context is limited as the extracted joint tokens already contain some spatial information. Future work could explore more sophisticated JFD modules to further enhance the model's representational capacity.	human pose estimation, video understanding, regression-based methods, spatiotemporal modeling, efficient deep learning
2403.19924 Report	SceneTracker: Long-term Scene Flow Estimation Network	Bo Wang, Jian Li, Yang Yu, Li Liu, Zhenping Sun, Dewen Hu	Considering the complementarity of scene flow estimation in the spatial domain's focusing capability and 3D object tracking in the temporal domain's coherence, this study aims to address a comprehensive new task that can simultaneously capture fine-grained and long-term 3D motion in an online manner: long-term scene flow estimation (LSFE). We introduce SceneTracker, a novel learning-based LSFE network that adopts an iterative approach to approximate the optimal trajectory. Besides, it dynamically indexes and constructs appearance and depth correlation features simultaneously and employs the Transformer to explore and utilize long-range connections within and between trajectories. With detailed experiments, SceneTracker shows superior capabilities in handling 3D spatial occlusion and depth noise interference, highly tailored to the LSFE task's needs. Finally, we build the first real-world evaluation dataset, LSFDriving, further substantiating SceneTracker's commendable generalization capacity. The code and data for SceneTracker is available at https://github.com/wwsource/SceneTracker.	This paper introduces the novel task of Long-Term Scene Flow Estimation (LSFE) and proposes SceneTracker, a learning-based network to estimate the 3D trajectory of a target point over a video sequence.	LSFE bridges the gap between Scene Flow Estimation, focusing on instantaneous motion, and 3D Object Tracking, limited to bounding boxes, by enabling fine-grained long-term 3D motion capture for comprehensive scene understanding.	SceneTracker employs an iterative approach with a sliding window mechanism, dynamically constructing appearance and depth correlation features, and leveraging Transformer to model long-range dependencies within and across trajectories.	SceneTracker significantly outperforms scene flow and tracking-based baselines on the synthetic LSFOdyssey dataset, demonstrating robustness against occlusion and depth noise. The paper introduces the first real-world LSFE dataset, LSFDriving, featuring annotated 3D trajectories for static backgrounds, moving vehicles, and non-rigid pedestrians. Evaluation on LSFDriving showcases SceneTracker's generalization ability from synthetic to real-world data, achieving promising results even for challenging non-rigid motions.	The reliance on dense depth maps, obtained through completion methods for real-world data, introduces potential limitations. Future work could explore event cameras or multi-view settings to enhance robustness and accuracy, particularly for non-rigid motion estimation.	scene flow estimation, 3d object tracking, long-term scene flow estimation, transformer, autonomous driving
2403.19919 Report	Diff-Reg v1: Diffusion Matching Model for Registration Problem	Qianliang Wu, Haobo Jiang, Lei Luo, Jun Li, Yaqing Ding, Jin Xie, Jian Yang	Establishing reliable correspondences is essential for registration tasks such as 3D and 2D3D registration. Existing methods commonly leverage geometric or semantic point features to generate potential correspondences. However, these features may face challenges such as large deformation, scale inconsistency, and ambiguous matching problems (e.g., symmetry). Additionally, many previous methods, which rely on single-pass prediction, may struggle with local minima in complex scenarios. To mitigate these challenges, we introduce a diffusion matching model for robust correspondence construction. Our approach treats correspondence estimation as a denoising diffusion process within the doubly stochastic matrix space, which gradually denoises (refines) a doubly stochastic matching matrix to the ground-truth one for high-quality correspondence estimation. It involves a forward diffusion process that gradually introduces Gaussian noise into the ground truth matching matrix and a reverse denoising process that iteratively refines the noisy matching matrix. In particular, the feature extraction from the backbone occurs only once during the inference phase. Our lightweight denoising module utilizes the same feature at each reverse sampling step. Evaluation of our method on both 3D and 2D3D registration tasks confirms its effectiveness.	Introduces Diff-Reg, a novel diffusion matching model for robust correspondence construction in 3D and 2D3D registration tasks.	Addresses challenges of existing methods in handling large deformation, scale inconsistency, and ambiguous matching in registration tasks by treating correspondence estimation as a denoising diffusion process.	Utilizes a diffusion model within the doubly stochastic matrix space, iteratively refining a noisy matching matrix to the ground truth for optimal correspondence estimation. Employs a lightweight denoising module with Sinkhorn Projection, Weighted SVD, Warping Function, Denoising Transformer, and Matching function.	Achieves state-of-the-art performance on 4DMatch and 4DLoMatch benchmarks for non-rigid registration, demonstrating improved handling of large deformation and low overlap. Outperforms single-pass baselines on 3DMatch benchmark for rigid registration, highlighting the effectiveness of iterative refinement through reverse denoising sampling. Shows promising results on the challenging RGB-D Scenes V2 benchmark for 2D3D registration, effectively addressing scale ambiguity issues.	Limited performance on 3DLoMatch due to the absence of specialized geometric embedding in the feature backbone. Generic transformer design in the denoising module might benefit from incorporating task-specific priors for further improvements, especially for challenging local non-rigid motions.	3d registration, 2d3d registration, diffusion model, correspondence estimation, doubly stochastic matrix
2403.19898 Report	Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting	Haipeng Liu, Yang Wang, Biao Qian, Meng Wang, Yong Rui	Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. In this paper, we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model named StrDiffusion, to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting, while revealing: 1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage; 2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process, benefiting from the time-dependent sparsity of the structure semantics. For the denoising process, a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides, we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process, while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.	This paper proposes StrDiffusion, a novel structure-guided diffusion model for image inpainting that leverages structure guidance to improve semantic consistency between masked and unmasked regions during the denoising process.	Existing diffusion-based inpainting methods often produce semantically meaningful results but struggle to maintain consistency between the restored and original image regions, especially with dense textures.	The authors reformulate the traditional texture denoising process by incorporating guidance from a progressively sparser structure representation. This structure guides a time-dependent noise network to estimate a simplified denoising objective, balancing semantic consistency and meaningful generation.	StrDiffusion demonstrates superior performance over state-of-the-art methods in terms of PSNR, SSIM, and FID scores. The proposed method effectively mitigates semantic discrepancy issues between masked and unmasked regions. An adaptive resampling strategy further enhances performance by regulating the semantic correlation between denoised texture and structure.	The computational cost of StrDiffusion is higher than some competing methods due to the use of both structure and texture diffusion processes. Future work could explore extending StrDiffusion to other image restoration tasks beyond inpainting.	image inpainting, diffusion models, structure guidance, semantic consistency, adaptive resampling
2403.19888 Report	MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection	Ali Behrouz, Michele Santacatterina, Ramin Zabih	Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block using quasi-separable kernels with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).	The paper introduces MambaMixer and QSMixer, two novel sequence modeling architectures based on selective state space models (SSMs) with dual selection mechanisms across both channels and tokens, enabling efficient and effective information mixing and filtering.	Existing SSM-based models lack channel mixing, limiting their performance and stability in multi-dimensional data like images and videos. MambaMixer and QSMixer address this by selectively mixing information across both channels and tokens, improving performance and efficiency in vision tasks.	The authors leverage quasi-separable matrices as a heuristic for non-causal selective channel mixing, leading to a hardware-friendly linear-time training. For vision tasks, they design ViM2 and ViQS models based on MambaMixer and QSMixer, incorporating a Switch of Scans (SoS) module for dynamic scan selection and a gating mechanism with multi-resolution convolutions to enhance receptive fields.	MambaMixer and QSMixer outperform existing SSM-based models in image classification on ImageNet and sCIFAR datasets, highlighting the importance of selective channel mixing. ViM2 and ViQS achieve competitive performance compared to well-established vision models in image classification, object detection, and semantic segmentation tasks, with superior efficiency in terms of FLOPs and memory usage. The quasi-separable formulation of channel mixing significantly improves throughput compared to traditional scan-based implementations.	The study primarily focuses on vision tasks, leaving the evaluation of selective channel mixing on NLP tasks for future work. Further exploration of techniques to enhance the efficiency of ViM2 and ViQS, beyond the current simple architecture, is a potential direction for future research.	sequence modeling, state space models, vision transformers, channel mixing, quasi-separable matrices
2403.19866 Report	Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization	Yuhang Li, Xin Dong, Chen Chen, Jingtao Li, Yuxin Wen, Michael Spranger, Lingjuan Lyu	Synthetic image data generation represents a promising avenue for training deep learning models, particularly in the realm of transfer learning, where obtaining real images within a specific domain can be prohibitively expensive due to privacy and intellectual property considerations. This work delves into the generation and utilization of synthetic images derived from text-to-image generative models in facilitating transfer learning paradigms. Despite the high visual fidelity of the generated images, we observe that their naive incorporation into existing real-image datasets does not consistently enhance model performance due to the inherent distribution gap between synthetic and real images. To address this issue, we introduce a novel two-stage framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability and subsequently uses real data for rapid adaptation. Alongside, We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images. Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements, with up to 30% accuracy increase on classification tasks. Intriguingly, we note that the enhancements were not yet saturated, indicating that the benefits may further increase with an expanded volume of synthetic data.	This paper explores using synthetic image data generated by text-to-image models to enhance transfer learning performance in computer vision.	Transfer learning relies on large datasets, which are often expensive or difficult to acquire for specific domains. Synthetic data offers a solution.	The authors introduce a two-stage 'bridged transfer' framework. First, an ImageNet-pretrained model is fine-tuned on synthetic data. Second, the model is further fine-tuned on the target domain's real data. They also propose a 'Dataset Style Inversion' technique to align synthetic images' style with the target domain.	Simply mixing real and synthetic data hurts performance due to distribution mismatch. Bridged transfer improves model transferability and achieves faster convergence on real data. Dataset Style Inversion further improves accuracy by aligning synthetic and real image styles.	The study primarily focuses on image classification tasks. Future work can investigate extending these techniques to other computer vision tasks.	transfer learning, synthetic data, text-to-image generation, dataset style inversion, computer vision
2403.19838 Report	Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving	Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi	Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have become prominent in autonomous driving research, as these models can provide interpretable textual reasoning and responses for end-to-end autonomous driving safety tasks using traffic scene images and other data modalities. However, current approaches to these systems use expensive large language model (LLM) backbones and image encoders, making such systems unsuitable for real-time autonomous driving systems where tight memory constraints exist and fast inference time is necessary. To address these previous issues, we develop EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving. In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher CIDEr and ROUGE-L scores than the existing baseline on the DriveLM dataset. EM-VLM4AD also exhibits the ability to extract relevant information from traffic views related to prompts and can answer questions for various autonomous driving subtasks. We release our code to train and evaluate our model at https://github.com/akshaygopalkr/EM-VLM4AD.	This paper introduces EM-VLM4AD, an efficient multi-frame vision language model for Visual Question Answering (VQA) in autonomous driving, designed to be lightweight and computationally less demanding than current models.	Current VLM and MMLM models for autonomous driving rely on large, computationally expensive backbones, making them unsuitable for real-time applications in vehicles with limited resources. This work addresses this by proposing a smaller and more efficient model.	EM-VLM4AD uses a pretrained ViT model for image encoding and T5 (Base or quantized Large) as the LM backbone. It employs a two-stage training process: 1) align multi-view image embeddings with LM embeddings and 2) finetune the LM. The model is trained and evaluated on the DriveLM dataset.	EM-VLM4AD requires at least 10 times less memory and FLOPs compared to existing AD-VLMs. Despite being smaller, EM-VLM4AD achieves higher CIDEr and ROUGE scores than the DriveLM baseline. The model demonstrates the ability to process information from multiple camera views and answer diverse questions related to autonomous driving tasks.	EM-VLM4AD struggles with questions related to predicting ego-vehicle behavior, possibly due to the lack of temporal context. Future work includes extending the model to process video inputs for better handling of temporal information and incorporating multimodal retrieval augmented generation for improved context awareness.	vision language models, multimodal learning, autonomous driving, visual question answering, efficient ai
2403.19811 Report	X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization	Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma	Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic	This paper proposes X-MIC, a cross-modal adaptation framework for vision-language models, to improve egocentric video classification through aligning frozen text embeddings to videos.	Egocentric action recognition suffers from domain gaps between web and egocentric data, making zero-shot generalization challenging. This paper aims to address this for real-world applications.	The method uses a video adapter to learn aligned text embeddings for each egocentric video directly in the shared embedding space, disentangling temporal modeling and the visual encoder. It introduces a novel egocentric spatial-temporal attention module to enhance hand-object interaction information.	X-MIC outperforms state-of-the-art VL adaptation methods in both within-dataset and cross-dataset evaluations on Ego4D, Epic-Kitchens, and EGTEA. Using a separate visual encoder like DINO further enhances performance. The ego-spatial-temporal attention module effectively captures hand-object interactions, improving recognition.	The method is currently limited to video classification and doesn't cover text-vision tasks like text-to-video retrieval. The impact of different pre-training strategies on verb and noun recognition needs further investigation.	egocentric action recognition, vision-language models, cross-modal adaptation, zero-shot learning, attention mechanisms
2403.19797 Report	Efficient 3D Instance Mapping and Localization with Neural Fields	George Tang, Krishna Murthy Jatavallabhula, Antonio Torralba	We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. Towards this, we introduce 3DIML, a novel framework that efficiently learns a label field that may be rendered from novel viewpoints to produce view-consistent instance segmentation masks. 3DIML significantly improves upon training and inference runtimes of existing implicit scene representation based methods. Opposed to prior art that optimizes a neural field in a self-supervised manner, requiring complicated training procedures and loss function design, 3DIML leverages a two-phase process. The first phase, InstanceMap, takes as input 2D segmentation masks of the image sequence generated by a frontend instance segmentation model, and associates corresponding masks across images to 3D labels. These almost view-consistent pseudolabel masks are then used in the second phase, InstanceLift, to supervise the training of a neural label field, which interpolates regions missed by InstanceMap and resolves ambiguities. Additionally, we introduce InstanceLoc, which enables near realtime localization of instance masks given a trained label field and an off-the-shelf image segmentation model by fusing outputs from both. We evaluate 3DIML on sequences from the Replica and ScanNet datasets and demonstrate 3DIML's effectiveness under mild assumptions for the image sequences. We achieve a large practical speedup over existing implicit scene representation methods with comparable quality, showcasing its potential to facilitate faster and more effective 3D scene understanding.	This paper introduces \coolname{}, an efficient two-phase framework for 3D instance segmentation from posed RGB images using a neural label field.	Existing neural field-based methods for 3D instance segmentation are computationally expensive and complex to train. \coolname{} offers a faster and simpler alternative.	\coolname{} uses InstanceMap to associate 2D instance masks across images and generate pseudo-labels. Then, InstanceLift, a neural label field, refines these labels for 3D consistency. Finally, InstanceLoc enables fast instance localization in novel views.	\coolname{} achieves comparable accuracy to state-of-the-art methods like Panoptic Lifting but with significantly faster runtime (14-24x). InstanceLift effectively refines noisy pseudo-labels generated by InstanceMap, improving the overall 3D instance segmentation. InstanceLoc, leveraging a fast 2D instance segmentation model and the trained label field, enables real-time instance localization in novel views.	Extreme viewpoint changes in the input sequence can lead to discontinuous 3D instance labels. Future work can focus on improving label consistency in challenging scenarios and exploring alternative neural field architectures for faster inference.	3d instance segmentation, neural fields, instance segmentation, novel view synthesis, scene understanding
2403.19776 Report	CLoRA: A Contrastive Approach to Compose Multiple LoRA Models	Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag	Low-Rank Adaptations (LoRAs) have emerged as a powerful and popular technique in the field of image generation, offering a highly effective way to adapt and refine pre-trained deep learning models for specific tasks without the need for comprehensive retraining. By employing pre-trained LoRA models, such as those representing a specific cat and a particular dog, the objective is to generate an image that faithfully embodies both animals as defined by the LoRAs. However, the task of seamlessly blending multiple concept LoRAs to capture a variety of concepts in one image proves to be a significant challenge. Common approaches often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). To overcome these issues, CLoRA addresses them by updating the attention maps of multiple LoRA models and leveraging them to create semantic masks that facilitate the fusion of latent representations. Our method enables the creation of composite images that truly reflect the characteristics of each LoRA, successfully merging multiple concepts or styles. Our comprehensive evaluations, both qualitative and quantitative, demonstrate that our approach outperforms existing methodologies, marking a significant advancement in the field of image generation with LoRAs. Furthermore, we share our source code, benchmark dataset, and trained LoRA models to promote further research on this topic.	This paper introduces CLoRA, a novel training-free method that addresses the challenges of composing multiple concept and style LoRAs (Low-Rank Adaptations) simultaneously during test time for image generation.	The ability to combine LoRAs is crucial for leveraging compositionality in image generation. It enables users to create personalized and diverse images by combining various concepts and styles encoded in pre-trained LoRAs.	CLoRA utilizes contrastive learning and attention map manipulation during test time. It generates multiple prompts with and without LoRA applications, groups attention maps by concept, and uses contrastive loss to guide latent representation updates. This resolves attention overlap and attribute binding issues, ensuring each LoRA contributes correctly to the final image.	CLoRA successfully integrates multiple content and style LoRAs, generating images that faithfully reflect the characteristics of each LoRA model. Qualitative comparisons demonstrate CLoRA's superiority over existing methods, showcasing its ability to maintain individual LoRA identities and avoid attribute blending. Quantitative analysis using DINO-based metrics further confirms CLoRA's effectiveness in merging LoRA content, surpassing baselines in fidelity and accuracy.	The effectiveness of CLoRA depends on the quality of the input LoRA models. Computational complexity, especially with numerous LoRAs, might impact processing time and resource requirements.	image generation, lora, contrastive learning, attention mechanism, compositionality
2403.19738 Report	MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models	Hidir Yesiltepe, Kiymet Akdemir, Pinar Yanardag	Diffusion-based text-to-image models have rapidly gained popularity for their ability to generate detailed and realistic images from textual descriptions. However, these models often reflect the biases present in their training data, especially impacting marginalized groups. While prior efforts to debias language models have focused on addressing specific biases, such as racial or gender biases, efforts to tackle intersectional bias have been limited. Intersectional bias refers to the unique form of bias experienced by individuals at the intersection of multiple social identities. Addressing intersectional bias is crucial because it amplifies the negative effects of discrimination based on race, gender, and other identities. In this paper, we introduce a method that addresses intersectional bias in diffusion-based text-to-image models by modifying cross-attention maps in a disentangled manner. Our approach utilizes a pre-trained Stable Diffusion model, eliminates the need for an additional set of reference images, and preserves the original quality for unaltered concepts. Comprehensive experiments demonstrate that our method surpasses existing approaches in mitigating both single and intersectional biases across various attributes. We make our source code and debiased models for various attributes available to encourage fairness in generative models and to support further research.	This paper introduces MIST, a novel method for mitigating intersectional bias in text-to-image diffusion models by disentangled fine-tuning of cross-attention maps.	Addressing intersectional bias in text-to-image models is crucial for ensuring fairness and preventing the amplification of discrimination against individuals at the intersection of multiple marginalized identities.	MIST leverages the observation that the token in text embeddings can control image generation in a disentangled way. It optimizes the cross-attention projection matrices by minimizing the difference between the token embeddings of a source prompt and a guidance prompt, thus aligning the model's representation towards the desired, unbiased output.	MIST effectively mitigates both single and intersectional biases across various attributes like gender, race, age, and eyeglasses, as demonstrated qualitatively and quantitatively. Compared to existing debiasing methods, MIST achieves superior performance in reducing bias while preserving the fidelity of unrelated concepts, as evidenced by lower biasedness scores and average pixel shifts. Unlike previous methods, MIST doesn't require additional reference images or manually curated preservation lists, making it more practical and scalable.	The debiasing capabilities of MIST are limited by the biases present in the pre-trained Stable Diffusion model and the CLIP language model used for evaluation. Future work includes exploring alternative evaluation metrics and addressing potential biases in the evaluation process itself.	intersectional bias, text-to-image synthesis, diffusion models, debiasing, fairness
2403.19716 Report	Capability-aware Prompt Reformulation Learning for Text-to-Image Generation	Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma	Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user's capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM's behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments on standard text-to-image generation benchmarks showcase CAPR's superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.	This paper presents CAPR, a novel capability-aware prompt reformulation framework for text-to-image generation, trained on user interaction logs to address the challenge of poor prompts from users unfamiliar with prompt crafting.	Crafting effective prompts for text-to-image generation systems is difficult for most users, and existing query reformulation techniques don't translate well due to the lack of system feedback and dependence on user capability in this domain.	CAPR decomposes the reformulation model into a Conditional Reformulation Model (CRM), trained on prompt pairs and user capability conditions derived from prompt quality metrics, and Configurable Capability Features (CCF) to represent and tune capability levels during inference.	CAPR significantly outperforms baselines like GPT-4 and existing reformulation models in improving generation quality. The framework effectively transfers to unseen, more advanced text-to-image generation systems, demonstrating robustness. Analysis shows CRM can be effectively controlled by CCF conditions, even extrapolating beyond training data limitations.	The study primarily focuses on overall user satisfaction, requiring further exploration for users with specific needs. Future work can explore incorporating visual feedback to improve reformulation effectiveness.	text-to-image generation, prompt reformulation, log analysis, user capability, conditional generation
2403.19653 Report	Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond	Katherine Xu, Lingzhi Zhang, Jianbo Shi	Modern text-to-image (T2I) diffusion models can generate images with remarkable realism and creativity. These advancements have sparked research in fake image detection and attribution, yet prior studies have not fully explored the practical and scientific dimensions of this task. In addition to attributing images to 12 state-of-the-art T2I generators, we provide extensive analyses on what inference stage hyperparameters and image modifications are discernible. Our experiments reveal that initialization seeds are highly detectable, along with other subtle variations in the image generation process to some extent. We further investigate what visual traces are leveraged in image attribution by perturbing high-frequency details and employing mid-level representations of image style and structure. Notably, altering high-frequency information causes only slight reductions in accuracy, and training an attributor on style representations outperforms training on RGB images. Our analyses underscore that fake images are detectable and attributable at various levels of visual granularity than previously explored.	This paper presents an in-depth analysis of detecting and attributing images generated by 12 state-of-the-art text-to-image diffusion models, going beyond RGB analysis by exploring detectable traces in high-frequency perturbations and mid-level representations.	This research is crucial for advancing image forensics, copyright protection, and ensuring the integrity of digital content in the age of increasingly sophisticated AI-generated images.	The authors generated a dataset of nearly half a million AI-generated images using diverse prompts and hyperparameters. They trained various image attributors (classifiers) and rigorously analyzed their performance under different conditions like hyperparameter variations, post-editing modifications, and varying levels of visual detail.	Achieved over 90% accuracy in attributing images to their source generators, significantly outperforming random chance. Demonstrated that even subtle variations in inference-stage hyperparameters, especially initialization seeds, can be detected with high accuracy. Discovered that stylistic representations of images, captured using Gram matrices, are more effective than RGB data for image attribution, indicating unique stylistic fingerprints of generators.	Limited exploration of dataset expansion due to budget constraints. Difficulty in explaining the decision-making process of the attributors despite using Grad-CAM visualizations.	generative models, image attribution, image forensics, text-to-image synthesis, deep learning
2403.19596 Report	LocCa: Visual Pretraining with Location-aware Captioners	Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai	Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.	Proposes Location-aware Captioner (LocCa), a visual pretraining method using a multi-task decoder for image captioning, referring expression, and grounded captioning tasks.	Enhances visual representations with location-aware context, improving performance on localization downstream tasks without complex model architectures.	Pretrains an encoder-decoder model on WebLI dataset with OWL-ViT pseudo annotations, leveraging task-specific prefixes for multitask learning and predicting bounding boxes and captions sequentially.	Achieves state-of-the-art results on referring expression comprehension benchmarks (RefCOCO, RefCOCO+, RefCOCOg). Significantly outperforms baselines on referring expression segmentation and object detection. Maintains strong performance on holistic image understanding tasks (image classification, captioning, VQA) and surpasses baselines on object-centric tasks (VQAv2, GQA).	Limited exploration of zero-shot object detection capabilities. Current decoding strategy struggles to balance the quantity and quality of predicted boxes and labels.	localization, image captioning, vision language models, multitask learning, visual pretraining
2403.19593 Report	Frame by Familiar Frame: Understanding Replication in Video Diffusion Models	Aimon Rahman, Malsha V. Perera, Vishal M. Patel	Building on the momentum of image generation diffusion models, there is an increasing interest in video-based diffusion models. However, video generation poses greater challenges due to its higher-dimensional nature, the scarcity of training data, and the complex spatiotemporal relationships involved. Image generation models, due to their extensive data requirements, have already strained computational resources to their limits. There have been instances of these models reproducing elements from the training samples, leading to concerns and even legal disputes over sample replication. Video diffusion models, which operate with even more constrained datasets and are tasked with generating both spatial and temporal content, may be more prone to replicating samples from their training sets. Compounding the issue, these models are often evaluated using metrics that inadvertently reward replication. In our paper, we present a systematic investigation into the phenomenon of sample replication in video diffusion models. We scrutinize various recent diffusion models for video synthesis, assessing their tendency to replicate spatial and temporal content in both unconditional and conditional generation scenarios. Our study identifies strategies that are less likely to lead to replication. Furthermore, we propose new evaluation strategies that take replication into account, offering a more accurate measure of a model's ability to generate the original content.	This paper investigates sample replication in video diffusion models, exploring the extent, frequency, and strategies for mitigation.	As video diffusion models gain popularity, it's crucial to understand if they generate truly novel content or simply replicate training data. This has implications for copyright, privacy, and the reliability of AI-generated content.	The authors define video replication for different generation contexts (conditional and unconditional). They use the VSSCD metric to detect content replication and analyze FVD scores with augmented input frames to assess motion replication. Additionally, they examine data requirements for unique content generation and compare different model architectures.	Video diffusion models trained on limited datasets are prone to replicating content and motion from the training data. Image diffusion models require significantly less data than video diffusion models to generate unique content. Using a pre-trained text-to-image backbone and fine-tuning only the temporal layers can mitigate replication in video diffusion models.	Limited access to publicly available video diffusion models and their training data poses challenges for comprehensive analysis. Further research is needed to explore motion replication across varying content and in models trained on large-scale datasets.	video diffusion models, sample replication, generative ai, content originality, evaluation metrics
2403.19588 Report	DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs	Donghyun Kim, Byeongho Heo, Dongyoon Han	This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets' potential was overlooked due to untouched training methods and traditional design elements not fully revealing their capabilities. Our pilot study shows dense connections through concatenation are strong, demonstrating that DenseNets can be revitalized to compete with modern architectures. We methodically refine suboptimal components - architectural adjustments, block redesign, and improved training recipes towards widening DenseNets and boosting memory efficiency while keeping concatenation shortcuts. Our models, employing simple architectural elements, ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III - key architectures in the residual learning lineage. Furthermore, our models exhibit near state-of-the-art performance on ImageNet-1K, competing with the very recent models and downstream tasks, ADE20k semantic segmentation, and COCO object detection/instance segmentation. Finally, we provide empirical analyses that uncover the merits of the concatenation over additive shortcuts, steering a renewed preference towards DenseNet-style designs. Our code is available at https://github.com/naver-ai/rdnet.	This paper revitalizes Densely Connected Convolutional Networks (DenseNets) by modernizing their architecture and training methods to compete with prevailing ResNet-like architectures, showing the efficacy of concatenation shortcuts.	This is important because it challenges the dominance of additive shortcut-based models and highlights the potential of DenseNet-style designs for enhanced performance and efficiency.	The authors conducted a pilot study with thousands of random networks to validate the effectiveness of concatenation shortcuts. They then systematically refined DenseNets by widening the network, improving feature mixers, introducing more transition layers, and employing a patchification stem.	The revitalized DenseNets (RDNets) outperform Swin Transformer, ConvNeXt, and DeiT-III on ImageNet-1K benchmark with competitive performance on downstream tasks like ADE20K and COCO. RDNets demonstrate robustness to input image size variations, maintaining accuracy without significant latency or memory increase. Analysis shows RDNets learn distinct features compared to ConvNeXt, highlighting the unique learning dynamics of concatenation-based models.	Resource constraints prevented scaling RDNets to extremely large scales like ViT-G. Future work can explore further optimization of training hyperparameters for downstream tasks to achieve maximum precisions.	densenets, concatenation shortcuts, image classification, semantic segmentation, object detection
2403.19580 Report	OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation	Zhenyu Wang, Yali Li, Taichi Liu, Hengshuang Zhao, Shengjin Wang	In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of universality. In this paper, we propose \textbf{OV-Uni3DETR}, a unified open-vocabulary 3D detector via cycle-modality propagation. Compared with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1) Open-vocabulary 3D detection: During training, it leverages various accessible data, especially extensive 2D detection images, to boost training diversity. During inference, it can detect both seen and unseen classes. 2) Modality unifying: It seamlessly accommodates input data from any given modality, effectively addressing scenarios involving disparate modalities or missing sensor information, thereby supporting test-time modality switching. 3) Scene unifying: It provides a unified multi-modal model architecture for diverse scenes collected by distinct sensors. Specifically, we propose the cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D modalities, to support the aforementioned functionalities. 2D semantic knowledge from large-vocabulary learning guides novel class discovery in the 3D domain, and 3D geometric knowledge provides localization supervision for 2D detection images. OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6\% on average. Its performance using only RGB images is on par with or even surpasses that of previous point cloud based methods. Code and pre-trained models will be released later.	Proposes OV-Uni3DETR, a unified open-vocabulary 3D object detector that leverages cycle-modality propagation for knowledge transfer between 2D and 3D modalities.	Addresses limitations of existing 3D detectors, which are restricted to closed-vocabulary detection, specific input modalities, and often limited to either indoor or outdoor scenes. Aims to achieve universality in 3D object detection.	Introduces a unified multi-modal architecture that accommodates point clouds and RGB images during training, enabling test-time modality switching. Employs cycle-modality propagation: leverages 2D semantic knowledge for 3D novel class discovery and 3D geometric knowledge for supervising 2D detection without 3D annotations.	Achieves state-of-the-art performance on open-vocabulary 3D object detection benchmarks, surpassing previous methods. Demonstrates modality-switching capability, with performance using only RGB images on par with or surpassing point cloud-based methods. Effectively detects objects in both indoor and outdoor scenes, achieving scene-unifying capability.	Potential for improvement in handling noisy 3D boxes generated from 2D images. Exploration of incorporating more diverse 2D data sources and larger-scale pre-trained models for enhanced novel class detection.	open-vocabulary learning, 3d object detection, multi-modal learning, knowledge distillation, scene understanding
2403.19534 Report	Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance	Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang	Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of masked scene images, incorporating both the textual prompts and specified subjects. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. The process involves (i) Locate: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data construction pipeline. This pipeline extracts substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image dataset, leveraging publicly available large models. Extensive experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency. Project page can be found at \url{https://ali-vilab.github.io/largen-page/}.	This paper presents LAR-Gen, a novel text-subject-guided image inpainting approach that seamlessly incorporates specified subjects into scene images while adhering to textual prompts, enhancing customized image editing.	Existing inpainting methods often struggle to balance subject fidelity and local semantic coherence when guided by both text and subject images. LAR-Gen addresses this gap, enabling more precise and creative image editing.	LAR-Gen employs a coarse-to-fine strategy: (i) Locate mechanism confines editing to the masked region, (ii) Assign mechanism uses decoupled cross-attention for multi-modal guidance, and (iii) Refine mechanism leverages an auxiliary U-Net (RefineNet) to enhance subject details.	LAR-Gen demonstrates superior performance in preserving both subject identity and text semantic consistency. A novel data construction pipeline is introduced to address data scarcity, extracting region-level quadruples from large image datasets. LAR-Gen acts as a unified framework supporting text-only, image-only, and combined text-subject-guided inpainting within a single model.	Subject deformation capabilities are limited due to reliance on a single reference image. The model might prioritize certain conditions over others when multiple conditions conflict.	image inpainting, diffusion model, text-subject-guided, customized image editing, multi-modal guidance
2403.19522 Report	Model Stock: All we need is just a few fine-tuned models	Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han	This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. Breaking away from traditional practices that need a multitude of fine-tuned models for averaging, our approach employs significantly fewer models to achieve final weights yet yield superior accuracy. Drawing from key insights in the weight space of fine-tuned weights, we uncover a strong link between the performance and proximity to the center of weight space. Based on this, we introduce a method that approximates a center-close weight using only two fine-tuned models, applicable during or after training. Our innovative layer-wise weight averaging technique surpasses state-of-the-art model methods such as Model Soup, utilizing only two fine-tuned models. This strategy can be aptly coined Model Stock, highlighting its reliance on selecting a minimal number of models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both ID and OOD tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models are available at https://github.com/naver-ai/model-stock.	This paper proposes "Model Stock," an efficient fine-tuning technique for large pre-trained models achieving strong performance in both in-distribution (ID) and out-of-distribution (OOD) settings using significantly fewer models than traditional averaging methods.	Fine-tuning is crucial in adapting pre-trained models for specific tasks, impacting both accuracy and robustness against distribution shifts. Model Stock offers an efficient alternative to computationally expensive model averaging techniques like Model Soup.	The authors analyze the weight space of fine-tuned models, discovering that: 1) weights lie on a thin shell, and 2) proximity to the center of this shell correlates with improved ID and OOD performance. Leveraging these insights and a pre-trained model as a robust anchor, Model Stock approximates the center with minimal fine-tuned models.	Model Stock achieves comparable or superior performance to Model Soup using only two fine-tuned models, significantly reducing computational cost. On CLIP ViT-L/14, Model Stock achieves state-of-the-art 87.8% top-1 accuracy on ImageNet (ID) and 74.9% average on five OOD benchmarks. The method's effectiveness is demonstrated across various CLIP architectures and benchmark datasets.	Resource limitations prevented evaluation on larger-scale models beyond ViT-L. Future work will explore applying Model Stock to even larger models like ViT-G.	fine-tuning, model averaging, distribution shift, robustness, pre-trained models
2403.19517 Report	XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold	Guangyu Wang, Jinzhi Zhang, Fan Wang, Ruqi Huang, Lu Fang	We propose XScale-NVS for high-fidelity cross-scale novel view synthesis of real-world large-scale scenes. Existing representations based on explicit surface suffer from discretization resolution or UV distortion, while implicit volumetric representations lack scalability for large scenes due to the dispersed weight distribution and surface ambiguity. In light of the above challenges, we introduce hash featurized manifold, a novel hash-based featurization coupled with a deferred neural rendering framework. This approach fully unlocks the expressivity of the representation by explicitly concentrating the hash entries on the 2D manifold, thus effectively representing highly detailed contents independent of the discretization resolution. We also introduce a novel dataset, namely GigaNVS, to benchmark cross-scale, high-resolution novel view synthesis of realworld large-scale scenes. Our method significantly outperforms competing baselines on various real-world scenes, yielding an average LPIPS that is 40% lower than prior state-of-the-art on the challenging GigaNVS benchmark. Please see our project page at: xscalenvs.github.io.	This paper proposes XScale-NVS, a novel hash featurized manifold representation, coupled with deferred neural rendering, for high-fidelity cross-scale novel view synthesis of large-scale scenes.	Existing methods struggle to represent large-scale scenes with both macro-structure and micro-details. Explicit surface representations suffer from discretization resolution or UV distortion, while implicit volumetric representations lack scalability and have surface ambiguities.	The method leverages a pre-computed mesh as a surface proxy. It utilizes volumetric multi-resolution hash encoding to featurize the surface manifold directly. A deferred neural rendering pipeline with surface multisampling and a manifold deformation mechanism decodes the representation.	Significantly outperforms prior arts on the challenging GigaNVS benchmark and Tanks & Temples dataset. Reduces average LPIPS by 40% on GigaNVS compared to state-of-the-art. Demonstrates robustness to mesh resolution and superior efficiency.	Current method cannot fully address the incompleteness and occlusions caused by incorrect geometry. Future work includes exploring differentiable rendering for better geometry handling.	novel view synthesis, neural rendering, large-scale scene representation, hash encoding, deferred rendering
2403.19495 Report	CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians	Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, Nima Khademi Kalantari	The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.	This paper introduces CoherentGS, a novel approach for sparse novel view synthesis using 3D Gaussian Splatting (3DGS) by enforcing coherency among Gaussians through regularized optimization and depth-based initialization.	Existing 3DGS methods struggle with sparse inputs, leading to overfitting and poor novel view quality. NeRF-based alternatives, while designed for sparsity, have limitations in regularization and are not directly applicable to the explicit, unstructured nature of 3DGS.	CoherentGS assigns a Gaussian to each input image pixel, initializes their positions using monocular depth, and regularizes optimization using: 1) An implicit decoder for smooth single-view depth residuals. 2) Total variation loss for multi-view consistent depth. 3) Flow-based loss for similar Gaussian positions in corresponding image pairs.	Outperforms state-of-the-art sparse-view NeRF methods on LLFF and NVS-RGBD datasets, particularly in perceptual quality (LPIPS). Reconstructs high-quality textures and smooth geometry even with extremely sparse inputs (2-4 images). Identifies occluded regions, enabling targeted inpainting for realistic hallucination of missing details.	Struggles with transparent objects due to the single-Gaussian-per-pixel representation. Relies on monocular depth quality, potentially impacting performance with inaccurate estimates.	sparse view synthesis, 3d gaussian splatting, implicit decoder, novel view synthesis, 3d reconstruction
2403.19473 Report	Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM	Tongyan Hua, Lin Wang	Implicit neural representation (INR), in combination with geometric rendering, has recently been employed in real-time dense RGB-D SLAM. Despite active research endeavors being made, there lacks a unified protocol for fair evaluation, impeding the evolution of this area. In this work, we establish, to our knowledge, the first open-source benchmark framework to evaluate the performance of a wide spectrum of commonly used INRs and rendering functions for mapping and localization. The goal of our benchmark is to 1) gain an intuition of how different INRs and rendering functions impact mapping and localization and 2) establish a unified evaluation protocol w.r.t. the design choices that may impact the mapping and localization. With the framework, we conduct a large suite of experiments, offering various insights in choosing the INRs and geometric rendering functions: for example, the dense feature grid outperforms other INRs (e.g. tri-plane and hash grid), even when geometric and color features are jointly encoded for memory efficiency. To extend the findings into the practical scenario, a hybrid encoding strategy is proposed to bring the best of the accuracy and completion from the grid-based and decomposition-based INRs. We further propose explicit hybrid encoding for high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts the premise on robustness and computation efficiency.	This paper introduces the first open-source benchmark framework for evaluating the performance of different Implicit Neural Representations (INRs) and rendering functions within a unified RGB-D SLAM system.	A standardized benchmark is crucial for fair comparison and understanding how different INR and rendering choices impact mapping and localization accuracy, especially given the lack of such a framework in active NeRF-SLAM research.	The benchmark evaluates various INR structures (MLP, Dense Grid, Sparse Grid, Tri-plane, Factorization) and rendering functions (SDF-based) on two scenarios: a controlled lab setting (Replica dataset) and a practical setting with noisy data and partial scene coverage (NeuralRGBD dataset). Performance is assessed through metrics like ATE, PSNR, Depth L1, Accuracy, Completion, and Completion Ratio.	Dense grid representation consistently outperforms other INRs in the lab setting, achieving the best accuracy and speed. Decomposition-based INRs (Tri-plane, Factorization) show advantages in the practical setting with partial scene coverage, indicating better generalization but less accurate than dense grid. A novel "hybrid encoding" strategy combining dense grid and tri-plane achieves superior trajectory estimation and reconstruction fidelity in both scenarios.	The benchmark mainly focuses on orthogonal spatial splitting representations, neglecting recent advances in point-based methods like Point-NeRF and 3D Gaussians. Future work should explore diverse scene types and integrate SLAM-centric and NeRF-centric methodologies for a unified evaluation.	slam, nerf, implicit neural representation, benchmarking, 3d reconstruction
2403.19456 Report	Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization	Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Oliver Deussen, Weiming Dong, Jintao Li, Tong-Yee Lee	Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style are entangled. In this study, we reconsider the customization of content and style concepts from the perspective of parameter space construction. Unlike existing methods that utilize a shared parameter space for content and style, we propose a learning framework that separates the parameter space to facilitate individual learning of content and style, thereby enabling disentangled content and style. To achieve this goal, we introduce "partly learnable projection" (PLP) matrices to separate the original adapters into divided sub-parameter spaces. We propose "break-for-make" customization learning pipeline based on PLP, which is simple yet effective. We break the original adapters into "up projection" and "down projection", train content and style PLPs individually with the guidance of corresponding textual prompts in the separate adapters, and maintain generalization by employing a multi-correspondence projection learning strategy. Based on the adapters broken apart for separate training content and style, we then make the entity parameter space by reconstructing the content and style PLPs matrices, followed by fine-tuning the combined adapter to generate the target object with the desired appearance. Experiments on various styles, including textures, materials, and artistic style, show that our method outperforms state-of-the-art single/multiple concept learning pipelines in terms of content-style-prompt alignment.	Introduces "Break-for-Make", a novel learning framework using "partly learnable projection" (PLP) matrices to disentangle content and style customization in text-to-image generation.	Existing methods for content and style customization in text-to-image generation often lead to entangled results, limiting control over individual aspects.	Employs PLP matrices to separate parameter space for content and style, enabling individual training guided by corresponding textual prompts. Utilizes a multi-correspondence projection learning strategy for generalization.	Achieves superior content-style-prompt alignment compared to state-of-the-art methods. Demonstrates effective disentanglement of content and style customization. Maintains high fidelity in generated images.	Exploration of alternative projection learning strategies for potential improvements. Evaluation of the approach on a wider range of visual styles and complexities.	text-to-image generation, content-style disentanglement, personalized image generation, deep learning, computer vision
2403.19386 Report	PointCloud-Text Matching: Benchmark Datasets and a Baseline	Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu	In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore, we construct three new PTM benchmark datasets, namely 3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM. To tackle these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL divides negative pairs, which are much less error-prone than positive pairs, into clean and noisy subsets, and assigns them forward and reverse optimization directions respectively, thus enhancing robustness against noisy correspondence. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.	This paper introduces PointCloud-Text Matching (PTM), a novel instance-level retrieval task aiming to match point cloud and text data, and proposes RoMa, a robust baseline method for PTM.	PTM addresses the need for precise instance-level alignment between point clouds and textual descriptions, with applications in indoor/urban localization and scene retrieval. Existing methods are insufficient due to the challenges of noisy, sparse point cloud data and ambiguous textual descriptions.	The authors propose RoMa, comprising a Dual Attention Perception (DAP) module and a Robust Negative Contrastive Learning (RNCL) module. DAP captures local and global features through token and feature-level attention. RNCL handles noisy correspondences by identifying and differently optimizing for clean and noisy negative pairs.	RoMa significantly outperforms existing Image-Text Matching methods adapted to PTM, demonstrating its effectiveness. The study highlights the significant challenge noisy correspondences pose in PTM datasets. Ablation studies show the contribution of both DAP and RNCL to RoMa's performance.	The performance on PTM datasets is still relatively low compared to Image-Text Matching datasets, indicating room for improvement. The paper primarily focuses on indoor scene datasets, and future work could explore other environments like outdoor urban scenes.	pointcloud-text matching, cross-modal retrieval, 3d vision and language, dual attention, robust contrastive learning
2403.19322 Report	Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models	Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Ziyong Feng, Yongle Zhao, Yin Xie	The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent capabilities in instruction following and reasoning, has greatly advanced the field of visual reasoning. However, constrained by their non-lossless image tokenization, most MLLMs fall short of comprehensively capturing details of text and objects, especially in high-resolution images. To address this, we propose P2G, a novel framework for plug-and-play grounding of reasoning in MLLMs. Specifically, P2G exploits the tool-usage potential of MLLMs to employ expert agents to achieve on-the-fly grounding to critical visual and textual objects of image, thus achieving deliberate reasoning via multimodal prompting. We further create P2GB, a benchmark aimed at assessing MLLMs' ability to understand inter-object relationships and text in challenging high-resolution images. Comprehensive experiments on visual reasoning tasks demonstrate the superiority of P2G. Noteworthy, P2G achieved comparable performance with GPT-4V on P2GB, with a 7B backbone. Our work highlights the potential of plug-and-play grounding of reasoning and opens up a promising alternative beyond model scaling.	This paper proposes P$^2$G (Plug-and-Play Grounding), a framework that enhances multimodal large language models (MLLMs) to perform grounded reasoning on high-resolution and text-rich images, by leveraging external agents for retrieving crucial visual and textual clues.	Current MLLMs struggle to comprehensively capture details in complex images due to limitations in image tokenization and the need for extensive instruction tuning data. P$^2$G addresses these limitations by enabling MLLMs to call upon specialized agents for on-the-fly grounding, leading to more accurate and grounded reasoning.	P$^2$G employs OCR and visual grounding agents (PaddleOCR and Grounding-DINO) to extract textual and visual clues from images based on the MLLM’s assessment of the complexity of the given query. These clues, along with their positions, are integrated into multimodal prompts for subsequent reasoning by the MLLM.	P$^2$G significantly outperforms existing MLLMs on text-rich visual reasoning benchmarks, including DocVQA and ChartVQA, achieving up to 3x improvement. On a newly proposed challenging benchmark P$^2$GB, which includes high-resolution and text-rich images, P$^2$G demonstrates superior performance, even surpassing GPT-4V on certain tasks. Ablation studies confirm the importance of both grounding agents and the inclusion of spatial information for achieving optimal performance.	The model's reliance on external agents may introduce latency. The current implementation has a limited context window size for processing large amounts of textual information.	multimodal large language models, visual reasoning, plug-and-play, grounding, text recognition
2403.19319 Report	Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation	Yujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Müller, Matthias Nießner	We present Mesh2NeRF, an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. Many 3D generative approaches represent 3D scenes as radiance fields for training. Their ground-truth radiance fields are usually fitted from multi-view renderings from a large-scale synthetic 3D dataset, which often results in artifacts due to occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic solution to directly obtain ground-truth radiance fields from 3D meshes, characterizing the density field with an occupancy function featuring a defined surface thickness, and determining view-dependent color through a reflection function considering both the mesh and environment lighting. Mesh2NeRF extracts accurate radiance fields which provides direct supervision for training generative NeRFs and single scene representation. We validate the effectiveness of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in PSNR for view synthesis in single scene representation on the ABO dataset, a 0.69 PSNR enhancement in the single-view conditional generation of ShapeNet Cars, and notably improved mesh extraction from NeRF in the unconditional generation of Objaverse Mugs.	Presents Mesh2NeRF, a method to derive ground-truth radiance fields directly from textured 3D meshes for improved 3D generation.	Addresses limitations of existing methods that rely on 2D supervision from multi-view renderings, which can lead to inaccurate reconstructions, particularly with limited or imbalanced views.	Analytically generates a radiance field from a textured mesh by modeling the density field with an occupancy function and determining view-dependent color using a reflection function considering mesh and environment lighting.	Achieves a 3.12dB PSNR improvement in single scene representation on the ABO dataset. Shows a 0.69 PSNR enhancement in single-view conditional generation on ShapeNet Cars. Generates significantly improved mesh extractions from NeRF in unconditional generation on Objaverse Mugs.	Current implementation bakes lighting information into the appearance, similar to NeRF. Relies on existing ray sampling techniques designed for rendered images, limiting efficiency.	radiance field supervision, nerf generation, mesh prior, 3d generation, novel view synthesis
2403.19314 Report	Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction	Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, Xiaojuan Qi	Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper, we present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing. The code is available at https://github.com/CVMI-Lab/Total-Decom.git.	Presents Total-Decom, a novel framework for 3D scene reconstruction and decomposition into individual objects and backgrounds from multi-view images, minimizing the need for human annotations by leveraging the Segment Anything Model (SAM).	Editing and manipulating the 3D geometry of traditionally reconstructed scenes is challenging due to the lack of decomposed object entities. Total-Decom addresses this by enabling the extraction of object-level shapes for applications like editing, animation, and simulation.	Integrates SAM with a hybrid implicit-explicit neural surface representation. Employs an implicit neural field for reconstruction while distilling features from SAM. Extracts explicit mesh surfaces and distills features into their vertices. Uses SAM decoder to convert user clicks into object masks, guiding a mesh-based region-growing algorithm for object decomposition.	Achieves superior scene and decomposed object reconstruction quality compared to state-of-the-art methods like ObjSDF++ on the Replica dataset. Enables interactive decomposition of scenes at varying granularity levels, typically requiring only one click per object. Demonstrates robust background reconstruction, accurately reconstructing even occluded areas.	Limitations in handling occluded foreground areas due to the absence of training supervision for invisible regions. Future work will explore integrating generative methods to complete occluded 3D objects and further improve mesh quality.	3d reconstruction, scene decomposition, segment anything model (sam), neural implicit surfaces, region growing
2403.19254 Report	Imperceptible Protection against Style Imitation from Diffusion Models	Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, Seung-Hun Nam	Recent progress in diffusion models has profoundly enhanced the fidelity of image generation. However, this has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks' visual quality. Recognizing the importance of maintaining this, we develop a visually improved protection method that preserves its protection capability. To this end, we create a perceptual map to identify areas most sensitive to human eyes. We then adjust the protection intensity guided by an instance-aware refinement. We also integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.	This paper proposes IMPASTO, a novel method to protect artistic styles from unauthorized imitation by diffusion models while preserving the visual quality of the protected artwork.	The rise of powerful image generation models like Stable Diffusion leads to concerns about copyright infringement as they can be used to replicate artistic styles without permission.	IMPASTO introduces a perception-aware protection (PAP) strategy using perceptual maps based on Just Noticeable Difference (JND) models to identify areas less sensitive to human perception for perturbation. It further enhances imperceptibility by incorporating a perceptual constraint bank that leverages LPIPS, low-pass filtering, and CLIP features.	IMPASTO significantly improves the visual quality of protected images compared to existing methods while maintaining comparable protection performance. The instance-wise refinement in IMPASTO allows adaptation to specific artworks, leading to better trade-offs between imperceptibility and protection. IMPASTO demonstrates robustness against various countermeasures and generalizes well to unknown personalization methods and diffusion models.	Current protection methods rely on adversarial perturbations, which are computationally expensive and time-consuming. Future research could explore more efficient protection mechanisms to address the time constraints.	style protection, diffusion models, copyright infringement, adversarial perturbation, perceptual quality
2403.19205 Report	From Activation to Initialization: Scaling Insights for Optimizing Neural Fields	Hemanth Saratchandran, Sameera Ramasinghe, Simon Lucey	In the realm of computer vision, Neural Fields have gained prominence as a contemporary tool harnessing neural networks for signal representation. Despite the remarkable progress in adapting these networks to solve a variety of problems, the field still lacks a comprehensive theoretical framework. This article aims to address this gap by delving into the intricate interplay between initialization and activation, providing a foundational basis for the robust optimization of Neural Fields. Our theoretical insights reveal a deep-seated connection among network initialization, architectural choices, and the optimization process, emphasizing the need for a holistic approach when designing cutting-edge Neural Fields.	This paper provides theoretical insights into the scaling dynamics of Neural Fields, particularly focusing on how the number of parameters affects gradient descent convergence in relation to dataset size.	The paper addresses the lack of a comprehensive theoretical framework for Neural Fields, aiming to establish a foundation for their robust optimization.	The authors theoretically analyze the scaling laws for neural fields with sine, sinc, Gaussian, and wavelet activations, proving the convergence of gradient descent under specific overparameterization conditions. They also develop a novel initialization scheme and empirically validate their findings on various applications.	Neural Fields with sine, sinc, Gaussian, or wavelet activations require less overparameterization than those with ReLU for gradient descent convergence. The authors propose a novel initialization scheme that significantly improves parameter efficiency compared to standard methods like LeCun, Xavier, and Kaiming. Empirical validation on applications like image regression, super-resolution, shape reconstruction, and physics-informed neural networks supports the theoretical findings.	Theoretical results currently apply only to full-batch gradient descent, not mini-batch training. Exploring the generalization of the findings to other activation functions and network architectures is left for future work.	neural fields, overparameterization, initialization, gradient descent, scaling laws
2403.19164 Report	RecDiffusion: Rectangling for Image Stitching with Diffusion Models	Tianhao Zhou, Haipeng Li, Ziyi Wang, Ao Luo, Chen-Lin Zhang, Jiajun Li, Bing Zeng, Shuaicheng Liu	Image stitching from different captures often results in non-rectangular boundaries, which is often considered unappealing. To solve non-rectangular boundaries, current solutions involve cropping, which discards image content, inpainting, which can introduce unrelated content, or warping, which can distort non-linear features and introduce artifacts. To overcome these issues, we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion}, for image stitching rectangling. This framework combines Motion Diffusion Models (MDM) to generate motion fields, effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary. Followed by Content Diffusion Models (CDM) for image detail refinement. Notably, our sampling process utilizes a weighted map to identify regions needing correction during each iteration of CDM. Our RecDiffusion ensures geometric accuracy and overall visual appeal, surpassing all previous methods in both quantitative and qualitative measures when evaluated on public benchmarks. Code is released at https://github.com/lhaippp/RecDiffusion.	This paper presents RecDiffusion, the first diffusion-based learning framework for rectangling images stitched from multiple captures, overcoming limitations of cropping, inpainting, and warping methods.	Image stitching often results in non-rectangular boundaries, which are aesthetically unappealing. Existing solutions either sacrifice content, introduce artifacts, or struggle with non-linear features.	RecDiffusion utilizes a two-step process: 1) Motion Diffusion Models (MDM) generate motion fields to rectify irregular boundaries, and 2) Content Diffusion Models (CDM) refine image details using a weighted sampling map based on the Rank-Nullity Theorem.	RecDiffusion outperforms previous state-of-the-art methods in both quantitative metrics (FID, SSIM, PSNR) and qualitative comparisons. The method effectively eliminates white edges and minimizes artifacts like line discontinuities and distortions. RecDiffusion demonstrates strong generalization ability, effectively rectangling images from different datasets.	The model currently relies on pre-trained motion fields, potentially limiting performance. Future work could explore joint optimization of motion estimation and content refinement within the diffusion framework.	image stitching, image rectangling, diffusion models, motion estimation, content refinement
2403.19046 Report	LITA: Language Instructed Temporal-Localization Assistant	De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz	There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA	This paper proposes Language Instructed Temporal-Localization Assistant (LITA), a novel Video LLM framework designed to enable accurate temporal event localization in videos, addressing a key limitation of existing Video LLMs.	Temporal localization is crucial for comprehensive video understanding, differentiating videos from images. Current Video LLMs struggle to pinpoint event timings, hindering their ability to fully interpret and interact with video content.	LITA introduces three key innovations: 1) Time tokens: Representing relative timestamps (e.g., first 10% of the video) instead of absolute ones for improved time representation. 2) SlowFast tokens: Inspired by the SlowFast architecture, LITA uses densely sampled fast tokens for temporal information and sparsely sampled slow tokens for spatial details, enabling efficient processing of numerous frames. 3) Emphasis on temporal localization data: LITA is trained on a diverse range of tasks including a novel Reasoning Temporal Localization (RTL) task with the ActivityNet-RTL dataset. RTL requires models to reason about events not explicitly stated, promoting both temporal and contextual understanding.	On the ActivityNet-RTL benchmark, LITA significantly outperforms baseline models, nearly doubling the temporal mean intersection-over-union (mIoU) score. LITA demonstrates the ability to provide detailed and accurate explanations for its temporal localization reasoning, showcasing enhanced video understanding. Beyond accurate temporal localization, LITA exhibits substantial improvements in general video-based text generation tasks compared to existing Video LLMs, including a 36% relative improvement in Temporal Understanding on a benchmark by Maaz et al. (2023).	The discretization of timestamps into time tokens, while beneficial, introduces a level of discretization error in temporal localization. Future work could explore alternative time representation methods within the LLM framework to potentially mitigate discretization error and further enhance temporal accuracy.	video language models, temporal localization, reasoning, multimodal learning, computer vision
2403.18978 Report	TextCraftor: Your Text Encoder Can be Image Quality Controller	Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren	Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.	This paper introduces TextCraftor, a novel approach to fine-tuning the text encoder in text-to-image diffusion models for improved image quality and text-image alignment.	Existing methods for improving diffusion models primarily focus on fine-tuning the UNet or using larger language models, which can be computationally expensive. Fine-tuning the text encoder offers a more efficient way to enhance performance.	TextCraftor leverages public reward functions (e.g., aesthetics, text-image alignment) to guide the fine-tuning process. It employs a prompt-based approach, eliminating the need for paired text-image datasets and enabling optimization with only text prompts. To ensure generality and avoid mode collapse, it incorporates CLIP space similarity as a constraint during training.	TextCraftor significantly improves image quality and text-image alignment compared to pre-trained models like SDv1.5, SDv2.0, SDXL Base 0.9, and DeepFloyd-XL. It outperforms prompt engineering techniques and previous state-of-the-art methods like DDPO. The approach allows for controllable generation through interpolation of text embeddings from different fine-tuned models, enabling style mixing.	The reliance on public reward functions can limit performance to the capabilities of those functions. Fine-tuning larger diffusion models with TextCraftor can be computationally expensive, though the authors demonstrate strong generalization capabilities allowing for fine-tuning on smaller models and transferring to larger ones.	text-to-image generation, diffusion models, text encoder fine-tuning, reward functions, controllable image synthesis
2403.18922 Report	Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D	Mukund Varma T, Peihao Wang, Zhiwen Fan, Zhangyang Wang, Hao Su, Ravi Ramamoorthi	In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.	Lift3D is a novel method that leverages generalizable novel view synthesis to lift any 2D vision model to 3D, enabling view-consistent predictions from arbitrary angles without task-specific training or scene-specific optimization.	Extending 2D vision models to 3D is crucial for applications like autonomous driving and robotics, but is challenging due to the limited availability of 3D data and the complexity of existing methods.	Lift3D trains a neural renderer to interpolate features from pre-trained 2D vision models across multiple views, using a corrective aggregation strategy to ensure consistency.	Lift3D achieves comparable or better performance than state-of-the-art methods on 3D semantic segmentation, style transfer, and scene editing. It exhibits strong zero-shot generalization, enabling the lifting of various 2D vision models for tasks like open vocabulary segmentation and image colorization without additional training. The method is computationally efficient, particularly when generating predictions for numerous viewpoints.	Lift3D's performance may be limited in scenes with sparse views or complex light transport where epipolar geometry doesn't hold. The interpolation strategy may result in a slight loss of visual quality compared to per-scene optimization methods.	3d vision, novel view synthesis, feature lifting, zero-shot learning, multi-view consistency
2403.18820 Report	MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering	Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann	Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when na\"ively supervising them on sparse camera views, as the field simply overfits to the sparse-view inputs. To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn the radiance field weights solely from potentially sparse multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. This prior provides a good network weight initialization, thereby effectively addressing ambiguities in sparse-view capture. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose-canonicalized space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions as well as novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on both public and WildDynaCap dataset.	\model{} is a novel method for high-quality human performance capture and rendering from sparse multi-view or even monocular images using a meta-learned implicit human representation.	Sparse view human capture suffers from inherent ambiguities such as occlusions and depth ambiguity. Existing methods struggle to achieve both high fidelity and fast adaptation to novel poses, views, and illumination.	The method meta-learns optimal network weights of an implicit human representation in a pose-canonicalized space from multi-view imagery. This prior enables fast fine-tuning on sparse in-the-wild images and handles occlusions via a visibility map and proxy images.	Outperforms state-of-the-art methods in terms of geometry reconstruction and novel view synthesis on both public and a new \dataset{} dataset. Generalizes to novel poses, surface deformations, lighting conditions, and camera parameters. Supports reconstruction from various sparse multi-view and monocular imagery during both training and inference.	The method can be sensitive to template fitting and motion capture inaccuracies. Temporal information is not fully leveraged and could further enhance robustness. Future work includes exploring real-time fine-tuning and cross-identity prior learning.	human performance capture, meta-learning, implicit representations, sparse-view reconstruction, novel view synthesis
2403.18819 Report	Benchmarking Object Detectors with COCO: A New Path Forward	Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai	The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://cocorem.xyz	The paper introduces COCO-ReM, a refined version of the COCO dataset for object detection with higher-quality instance annotations.	COCO, while popular, has imperfections like coarse boundaries and non-exhaustive annotations, hindering its reliability in benchmarking object detectors.	The authors developed a semi-automatic pipeline using SAM for mask refinement, imported instances from LVIS for exhaustiveness, and manually verified the validation set.	All 50 evaluated object detectors scored higher on COCO-ReM than COCO-2017. Query-based detectors outperform region-based detectors on COCO-ReM, aligning with human judgment of mask sharpness. Models trained on COCO-ReM converge faster and perform better than those trained on COCO-2017, demonstrating the impact of data quality.	Potential noise from SAM's occasional hallucination of disconnected components. Limited manual verification to the validation set due to the large size of the training set.	object detection, instance segmentation, dataset, benchmarking, coco
2403.18814 Report	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia	In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.	Introduces Mini-Gemini, a simple yet effective framework that enhances multi-modality Vision Language Models (VLMs) by focusing on efficient high-resolution solutions, high-quality data, and expanded applications.	Aims to bridge the performance gap between existing VLMs and advanced models like GPT-4 and Gemini, particularly in academic settings with limited resources.	Utilizes dual vision encoders for low-resolution embedding and high-resolution candidate generation; employs patch info mining for efficient high-resolution detail extraction; constructs a high-quality dataset for training; and integrates with generative models for text and image generation.	Achieves leading performance in various zero-shot benchmarks, outperforming existing methods, including LLaVA-1.5 and LLaVA-NeXT. Demonstrates superior performance even compared to high-resource private models like Gemini Pro and Qwen-VL-Plus on challenging benchmarks like MMB and MMMU. Showcases strong capabilities in handling complex visual understanding and reasoning tasks, as well as generating contextually relevant images from multi-modal instructions.	Limitations in counting ability and complex visual reasoning due to potential gaps in training data. Exploration of more advanced methods for visual understanding, reasoning, and generation, particularly in bridging VLMs and diffusion models.	vision language models, multi-modality, image understanding, image generation, reasoning
2403.18807 Report	ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation	Suraj Patni, Aradhye Agarwal, Chetan Arora	In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io	This paper proposes a novel single image depth estimation (SIDE) model using a diffusion model conditioned on global image priors generated from a pre-trained Vision Transformer (ViT).	This approach addresses the limitations of learning-based SIDE models that heavily rely on shading and contextual cues, making them domain-specific and difficult to generalize.	The method utilizes a conditional diffusion architecture where semantic context is provided through embeddings generated using a pre-trained ViT model, rather than relying on pseudo image captions.	Achieves state-of-the-art performance on NYU Depth v2 and KITTI datasets, significantly outperforming previous methods. Demonstrates that using ViT embeddings for semantic context is more effective than employing pseudo captions and their CLIP embeddings. Exhibits strong generalization and zero-shot transfer capabilities, outperforming state-of-the-art methods even when trained on a single dataset.	The model requires significant computational resources for training. Further exploration of optimal ViT architectures and embedding dimensions could potentially improve performance.	single image depth estimation, diffusion models, vision transformer (vit), zero-shot transfer, semantic context
2403.18795 Report	Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction	Qiuhong Shen, Xuanyu Yi, Zike Wu, Pan Zhou, Hanwang Zhang, Shuicheng Yan, Xinchao Wang	We tackle the challenge of efficiently reconstructing a 3D asset from a single image with growing demands for automated 3D content creation pipelines. Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF). Despite their significant success, these approaches encounter practical limitations due to lengthy optimization and considerable memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D reconstruction model from single-view images, emphasizing two main insights: (1) 3D representation: leveraging a large number of 3D Gaussians for an efficient 3D Gaussian splatting process; (2) Backbone design: introducing a Mamba-based sequential network that facilitates context-dependent reasoning and linear scalability with the sequence (token) length, accommodating a substantial number of Gaussians. Gamba incorporates significant advancements in data preprocessing, regularization design, and training methodologies. We assessed Gamba against existing optimization-based and feed-forward 3D generation approaches using the real-world scanned OmniObject3D dataset. Here, Gamba demonstrates competitive generation capabilities, both qualitatively and quantitatively, while achieving remarkable speed, approximately 0.6 second on a single NVIDIA A100 GPU.	Introducing Gamba, an end-to-end amortized 3D reconstruction model from single-view images using 3D Gaussian Splatting and a Mamba-based sequential network.	Addresses limitations of previous Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF) methods, which suffer from lengthy optimization, high memory usage, and rendering inefficiencies.	Combines 3D Gaussian Splatting for efficient representation with a Mamba-based sequential network (GambaFormer) for context-dependent reasoning and linear scalability with token length. Employs robust training techniques like Gaussian parameter constraints and data augmentation.	Achieves competitive generation quality compared to state-of-the-art methods, both qualitatively and quantitatively (PSNR, LPIPS, CLIP Distance). Exhibits remarkable speed, reconstructing a 3D asset in approximately 0.6 seconds on a single NVIDIA A100 GPU, significantly faster than optimization-based alternatives. Demonstrates effectiveness on the OmniObject3D dataset, showcasing reasonable geometry understanding and plausible texture generation.	Struggles to generate sharp textures for occluded areas, particularly with complex textures. Limited generalization to 'unseen' 3D assets with large domain disparity from the training data (OmniObject3D).	3d reconstruction, single-view reconstruction, 3d gaussian splatting, mamba network, amortized inference
2403.18784 Report	SplatFace: Gaussian Splat Face Reconstruction Leveraging an Optimizable Surface	Jiahao Luo, Jing Liu, James Davis	We present SplatFace, a novel Gaussian splatting framework designed for 3D human face reconstruction without reliance on accurate pre-determined geometry. Our method is designed to simultaneously deliver both high-quality novel view rendering and accurate 3D mesh reconstructions. We incorporate a generic 3D Morphable Model (3DMM) to provide a surface geometric structure, making it possible to reconstruct faces with a limited set of input images. We introduce a joint optimization strategy that refines both the Gaussians and the morphable surface through a synergistic non-rigid alignment process. A novel distance metric, splat-to-surface, is proposed to improve alignment by considering both the Gaussian position and covariance. The surface information is also utilized to incorporate a world-space densification process, resulting in superior reconstruction quality. Our experimental analysis demonstrates that the proposed method is competitive with both other Gaussian splatting techniques in novel view synthesis and other 3D reconstruction methods in producing 3D face meshes with high geometric precision.	SplatFace, a novel Gaussian splatting framework for 3D human face reconstruction from a limited set of input images without relying on accurate pre-determined geometry.	Existing methods for 3D face reconstruction either rely on a large number of input images or require accurate pre-determined geometry, limiting their practical application. This paper aims to address this limitation.	SplatFace incorporates a generic 3D Morphable Model (3DMM) and jointly optimizes the Gaussian splats and the morphable surface through a non-rigid alignment process guided by a novel splat-to-surface distance metric and world-space densification.	SplatFace achieves higher quality novel view synthesis with fewer artifacts compared to baseline Gaussian splatting methods. SplatFace outperforms state-of-the-art multi-view 3D face reconstruction methods in terms of geometric accuracy. Joint optimization with a generic 3DMM initialization effectively reconstructs 3D face shapes, achieving comparable results to using ground truth surface initialization.	The method might suffer from over-regularization in regions with complex geometry, such as teeth and hair, due to limitations of the surface model. While outperforming existing methods, the rendered images, especially in far test views, are not entirely artifact-free, indicating room for further improvement.	3d face reconstruction, gaussian splatting, novel view synthesis, 3d morphable model, few-shot learning
2403.18775 Report	ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object	Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, Chengzhi Mao	We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific type of evaluation over synthetic corruptions, backgrounds, and textures, yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work, we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models, we are able to generate images with more diversified backgrounds, textures, and materials than any prior work, where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60\%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at https://github.com/chenshuang-zhang/imagenet_d.	This paper introduces ImageNet-D, a new synthetic dataset for benchmarking the robustness of visual perception models, particularly against variations in background, texture, and material.	Existing robustness benchmarks often rely on synthetic images with limited diversity and realism, failing to accurately assess model robustness in real-world scenarios.	The authors leverage diffusion models to generate a vast pool of images with diverse object and nuisance combinations. Hard images, those misclassified by multiple surrogate models, are selectively retained and further validated by human annotators, forming the final ImageNet-D dataset.	ImageNet-D causes a significant accuracy drop (up to 60%) across a range of vision models, including ResNets, ViTs, CLIP, LLaVa, and MiniGPT-4. Existing data augmentation techniques, while effective on benchmarks like ImageNet-C, fail to improve robustness on ImageNet-D, suggesting its unique challenges. Training models on diffusion-generated images with diverse attributes can enhance robustness on ImageNet-D and generalize better to real-world datasets like ObjectNet.	The current version of ImageNet-D only includes a subset of ImageNet categories. Future work could explore generating even more challenging images by leveraging advancements in generative models and incorporating additional nuisance factors.	robustness, benchmarking, visual perception, diffusion models, synthetic data
2403.18660 Report	InstructBrush: Learning Attention-based Instruction Optimization for Image Editing	Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao	In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.	Proposes InstructBrush, a novel method to extract editing instructions from exemplar image pairs for image editing, addressing the limitations of language in describing complex editing tasks.	Instruction-based image editing methods, while powerful, struggle with edits that are challenging to express through language. Instruction inversion, learning instructions from visual examples, offers a solution.	Introduces Attention-based Instruction Optimization, directly optimizing instructions within the cross-attention layers of a diffusion model for enhanced representation. Also proposes Transformation-oriented Instruction Initialization, incorporating editing-specific priors by identifying unique phrases differentiating before-and-after edit images.	Outperforms existing methods in both local and global image editing tasks. Demonstrates superior instruction generalization, avoiding the introduction of irrelevant content from training images. Achieves higher scores on quantitative metrics such as PSNR, SSIM, LPIPS, and CLIP directional similarity.	Editing capabilities are limited by the prior of the base instruction-based editing model. Effectiveness of Transformation-oriented Instruction Initialization is dependent on the vocabulary used for unique phrase extraction.	image editing, prompt inversion, diffusion models, instruction learning, visual prompts
2403.18551 Report	Attention Calibration for Disentangled Text-to-Image Personalization	Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang	Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.	This paper introduces DisenDiff, a personalized text-to-image generation model that can learn multiple novel concepts from a single image and use them to generate novel images with these concepts in various contexts while preserving high fidelity to the original image.	Existing personalized text-to-image generation methods struggle to capture and independently manipulate multiple concepts from a single image, limiting their flexibility for creative editing and content creation.	DisenDiff achieves this by introducing an attention calibration mechanism that binds new word embeddings to corresponding class tokens and uses a separate and strengthen strategy to ensure distinct attention maps for different concepts during training.	DisenDiff outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior image fidelity and editing capabilities. The proposed attention calibration mechanism, specifically the binding and separate & strengthen constraints, are crucial for achieving high-fidelity and disentangled concept representation. DisenDiff is compatible with LoRA and inpainting pipelines, showcasing its potential for broader applications like personalized image editing and concept manipulation.	Disentangling fine-grained categories within the same semantic class (e.g., two dog breeds) remains challenging. Extending the method to handle more than three concepts effectively requires further algorithmic development.	text-to-image generation, personalized image synthesis, concept learning, attention mechanism, disentanglement
2403.18493 Report	VersaT2I: Improving Text-to-Image Models with Versatile Reward	Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang	Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria.	This paper introduces VersaT2I, a novel training framework to improve text-to-image (T2I) models by incorporating various reward signals without relying on resource-intensive reinforcement learning.	Existing T2I models often struggle to generate images that are aesthetically pleasing, geometrically accurate, and faithful to the input text. VersaT2I aims to address these limitations and improve the overall quality of generated images.	VersaT2I decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. It leverages pre-trained evaluation models for each aspect to select high-quality images generated by the T2I model. These selected images form a training set used to fine-tune the model using LoRA. Further, a novel Mixture of LoRA (MoL) approach combines multiple LoRA models trained on different aspects, improving the model's overall performance.	VersaT2I outperforms baseline methods, including direct LoRA merging and RL approaches, across various quality metrics. Single-reward LoRA models fine-tuned using VersaT2I show significant improvements in respective evaluation benchmarks for both SD v2.1 and SDXL. MoL successfully alleviates conflicts between different LoRAs, leading to consistent improvement in overall image quality.	The current implementation of VersaT2I relies on a limited number of predefined aspects and their corresponding evaluation models. Exploring a wider range of quality aspects and fine-grained annotations could further enhance the framework. Future work could focus on mitigating the potential societal impact of improved T2I models, such as the generation of deepfakes and manipulated content.	text-to-image generation, generative models, diffusion models, low-rank adaptation (lora), reward learning
2403.18476 Report	Modeling uncertainty for Gaussian Splatting	Luca Savant, Diego Valsesia, Enrico Magli	We present Stochastic Gaussian Splatting (SGS): the first framework for uncertainty estimation using Gaussian Splatting (GS). GS recently advanced the novel-view synthesis field by achieving impressive reconstruction quality at a fraction of the computational cost of Neural Radiance Fields (NeRF). However, contrary to the latter, it still lacks the ability to provide information about the confidence associated with their outputs. To address this limitation, in this paper, we introduce a Variational Inference-based approach that seamlessly integrates uncertainty prediction into the common rendering pipeline of GS. Additionally, we introduce the Area Under Sparsification Error (AUSE) as a new term in the loss function, enabling optimization of uncertainty estimation alongside image reconstruction. Experimental results on the LLFF dataset demonstrate that our method outperforms existing approaches in terms of both image rendering quality and uncertainty estimation accuracy. Overall, our framework equips practitioners with valuable insights into the reliability of synthesized views, facilitating safer decision-making in real-world applications.	Introduced Stochastic Gaussian Splatting (SGS), the first framework for uncertainty estimation using Gaussian Splatting (GS), enabling real-time synthesis of high-quality images with accurate uncertainty predictions.	Gaussian Splatting lacks a mechanism for estimating uncertainty in synthesized views, crucial for real-world applications requiring reliability assessments.	Employs Variational Inference to learn parameters of the GS radiance field in a Bayesian framework, incorporating uncertainty prediction into the rendering pipeline. Introduces Area Under Sparsification Error (AUSE) for optimizing uncertainty estimation alongside image reconstruction. Leverages Empirical Bayes for informative prior initialization.	Significantly improves rendering quality metrics (PSNR, SSIM, LPIPS) compared to state-of-the-art methods on the LLFF dataset. Achieves superior uncertainty estimation accuracy, measured by AUSE RMSE, compared to existing approaches. Demonstrates the effectiveness of the AUSE loss term in enhancing uncertainty map prediction.	Independence assumption between Gaussian kernels, though more general than previous works, might be limiting. Exploration of alternative uncertainty estimation metrics beyond AUSE for potential further improvements.	gaussian splatting, uncertainty estimation, novel view synthesis, variational inference, ause
2403.18417 Report	ECNet: Effective Controllable Text-to-Image Diffusion Models	Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li	The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.	This paper introduces ECNet, a novel framework for controllable text-to-image generation that leverages precise annotation information alongside text descriptions and a new Diffusion Consistency Loss (DCL).	Existing controllable text-to-image diffusion models often lack precision due to ambiguous condition inputs and inadequate condition guidance.	ECNet employs a Spatial Guidance Injector (SGI) to combine annotations with text for precise control. It introduces DCL to supervise the denoised latent code at each time step, ensuring consistency with the input signal.	ECNet achieves state-of-the-art performance on skeleton control tasks, surpassing HumanSD and ControlNet in metrics like AP and CAP. It demonstrates superior accuracy in facial landmark control tasks, exhibiting significant improvements in NME scores compared to baselines. ECNet effectively handles sketch control tasks, showcasing its versatility and capability in generating images from various conditions.	The effectiveness of ECNet's supervision relies on accurate annotation detection, which could be affected by detector performance. The evaluation of ECNet is limited in scope, lacking comprehensive testing across diverse conditions and scenarios.	text-to-image generation, diffusion models, controllable generation, spatial guidance injector, diffusion consistency loss
2403.18361 Report	ViTAR: Vision Transformer with Any Resolution	Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang	This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.	This paper introduces ViTAR (Vision Transformer with Any Resolution) to enhance the scalability of Vision Transformers (ViTs) across different image resolutions.	Existing ViTs often suffer performance degradation when processing resolutions different from training data, limiting their real-world applicability.	ViTAR incorporates two key innovations: (1) Adaptive Token Merger (ATM) for efficient incremental token integration across resolutions, and (2) Fuzzy Positional Encoding (FPE) to enhance positional awareness consistency across resolutions.	ViTAR achieves strong resolution generalization, reaching 83.3% top-1 accuracy at 1120x1120 and 80.4% at 4032x4032 resolution while reducing computational costs. ViTAR demonstrates robust performance in downstream tasks like instance and semantic segmentation. The model effectively combines with self-supervised learning techniques like Masked AutoEncoder (MAE).	The impact of varying the number of iterations in ATM across different tasks and datasets requires further exploration. Investigating the effectiveness of FPE in other self-supervised learning frameworks beyond MAE is a promising direction.	vision transformer, multi-resolution, positional encoding, adaptive token merger, self-supervised learning
2403.18036 Report	Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance	Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang	Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.	This paper introduces a novel two-stage model for generating human motion in 3D scenes guided by language descriptions, using scene affordance maps as an intermediate representation to connect scene grounding with motion generation.	Generating realistic human-scene interactions within 3D environments from language instructions is challenging due to the complexity of joint modeling and the scarcity of comprehensive language-scene-motion datasets.	The model consists of two stages: an Affordance Diffusion Model (ADM) predicts affordance maps from scene point clouds and language descriptions, and an Affordance-to-Motion Diffusion Model (AMDM) synthesizes human motions conditioned on the predicted affordance maps and language.	The method outperforms baselines in text-to-motion generation on HumanML3D and scene-aware motion generation on HUMANISE datasets. It exhibits strong generalization ability, generating plausible motions for novel language-scene pairs. Using scene affordance as an intermediate representation enhances both scene grounding and motion detail.	The model's reliance on diffusion models leads to slower inference times, which can be addressed in future work. Although the use of affordance maps alleviates the data scarcity issue, collecting more diverse and comprehensive language-scene-motion data remains crucial.	human-scene interaction, motion generation, scene affordance, diffusion model, 3d scene understanding
2403.18035 Report	Bidirectional Consistency Models	Liangchen Li, Jiajun He	Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce the Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. Notably, our proposed method enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. Furthermore, by leveraging our model's bidirectional consistency, we introduce a sampling strategy that can enhance FID while preserving the generated image content. We further showcase our model's capabilities in several downstream tasks, such as interpolation and inpainting, and present demonstrations of potential applications, including blind restoration of compressed images and defending black-box adversarial attacks.	The paper introduces Bidirectional Consistency Model (BCM), which learns a single neural network for both forward and backward traversal along the Probability Flow ODE, unifying generation and inversion tasks for diffusion models within a single framework.	Diffusion models are powerful generative models but their iterative nature for generation and inversion limits their speed. This paper aims to accelerate both tasks while maintaining or improving quality.	The paper extends Consistency Models by learning a bidirectional mapping between points on the same trajectory of the PF ODE. It introduces Bidirectional Consistency Training (BCT) that combines a consistency term with a soft trajectory constraint. The model enables one-step generation and inversion and also supports multi-step sampling strategies like ancestral and zigzag sampling.	BCM achieves comparable or better generation quality than earlier diffusion models with significantly fewer function evaluations (NFEs). BCM can achieve lower reconstruction error than ODE-based diffusion models with significantly fewer NFEs. BCM enables applications like image interpolation between real images, superior inpainting, blind restoration of compressed images, and defending black-box adversarial attacks.	While multi-step sampling in BCM improves results, the gains plateau quickly beyond a certain point due to error accumulation. The inversion process can sometimes alter image content, impacting downstream applications.	diffusion models, generative models, image generation, image inversion, consistency models
2403.17998 Report	Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval	Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao	The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.	This paper presents T-MASS, a new stochastic text modeling approach for text-video retrieval, enhancing text embedding with a resilient semantic range to better capture video clues.	Current methods struggle to align short, semantically limited text with the rich content of videos, hindering accurate retrieval.	T-MASS models text as a "mass" using stochastic embedding, incorporating a similarity-aware radius module for scale adaptation and support text regularization for position and scale control.	T-MASS effectively bridges relevant pairs while distancing irrelevant ones. The method facilitates precise text semantics mapping, adapting to video variations. T-MASS achieves state-of-the-art performance on five benchmark datasets, surpassing baselines by 3-6.3% on R@1.	The study primarily focuses on text embedding without extensively exploring advanced video feature extraction techniques. Further research could investigate the incorporation of additional modalities, like audio, to enhance retrieval accuracy.	text-video retrieval, stochastic text modeling, text mass, similarity-aware radius, support text regularization
2403.17935 Report	OmniVid: A Generative Framework for Universal Video Understanding	Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang	The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.	This paper introduces OmniVid, a generative framework that unifies various video understanding tasks by representing the output as a sequence of tokens from an enriched vocabulary, encompassing words, time tokens, and box tokens.	Existing video understanding models typically rely on task-specific architectures and annotations, hindering generalization. OmniVid addresses this limitation by unifying the output space, enabling a single framework to handle diverse video tasks.	OmniVid utilizes an encoder-decoder architecture. A video encoder extracts features, while a language encoder processes prompts. A novel Mixed Q-former aggregates frame features into content, sentence, and box queries. A token decoder generates the final token sequence based on the multimodal input.	OmniVid achieves state-of-the-art performance on multiple video benchmarks, including action recognition (83.6% on Kinetics-400), clip captioning (56.6 CIDEr on MSRVTT), and dense video captioning (5.6 SODA_c on ActivityNet). The framework effectively handles both coarse-grained tasks like action recognition and fine-grained tasks like object tracking. Jointly training the model across different video tasks shows promising results for classification and captioning while revealing challenges for localization tasks.	Joint training of OmniVid for spatial-temporal localization tasks currently shows performance degradation compared to separate training. Sparse frame sampling in dense video captioning can lead to overlooking subtle activity changes.	video understanding, generative model, unified framework, video captioning, object tracking
2403.17931 Report	Track Everything Everywhere Fast and Robustly	Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis	We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.	This paper introduces an optimization-based approach for fast and robust tracking of any pixel in a video, improving upon the efficiency and robustness of OmniMotion.	Long-term pixel tracking is fundamental for various computer vision tasks, but existing methods struggle with efficiency, robustness, or accuracy. This work addresses these limitations.	The authors propose CaDeX++, an invertible deformation network with local feature grid factorization and non-linear interpolation. They leverage monocular depth estimation (ZoeDepth) for geometry initialization and integrate DINOv2 semantics for long-term correspondence.	CaDeX++ significantly improves training speed (over 10 times faster) compared to OmniMotion. The proposed method achieves higher accuracy and robustness in tracking, particularly in challenging scenarios with occlusions and complex motions. Depth prior initialization and long-term semantic integration are shown to contribute significantly to the performance gains.	The method's performance heavily depends on the accuracy of the input depth and pixel correspondences. Future work includes exploring the application of CaDeX++ to other tasks like 3D reconstruction and object pose estimation.	pixel tracking, long-term tracking, test-time optimization, invertible deformation network, vision foundation models
2403.17924 Report	AID: Attention Interpolation of Text-to-Image Diffusion	Qiyuan He, Jinghao Wang, Ziwei Liu, Angela Yao	Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we introduce a novel training-free technique named Attention Interpolation via Diffusion (AID). Our key contributions include 1) proposing an inner/outer interpolated attention layer; 2) fusing the interpolated attention with self-attention to boost fidelity; and 3) applying beta distribution to selection to increase smoothness. We also present a variant, Prompt-guided Attention Interpolation via Diffusion (PAID), that considers interpolation as a condition-dependent generative process. This method enables the creation of new images with greater consistency, smoothness, and efficiency, and offers control over the exact path of interpolation. Our approach demonstrates effectiveness for conceptual and spatial interpolation. Code and demo are available at https://github.com/QY-H00/attention-interpolation-diffusion.	This paper proposes AID, a training-free technique for text-to-image diffusion models, enabling nuanced spatial and conceptual interpolations between images with different text prompts.	Existing methods for interpolation in the latent space fail to generate consistent, smooth, and high-fidelity images when interpolating between distinct textual conditions.	AID introduces an inner/outer interpolated attention layer, fuses it with self-attention, and utilizes beta distribution for sequence selection to enhance interpolation quality. It also presents PAID, a variant allowing prompt-guided interpolation paths.	AID significantly improves smoothness, consistency, and fidelity of interpolated image sequences compared to text embedding interpolation. Inner attention interpolation (AID-I) excels in conceptual blending, while outer attention interpolation (AID-O) is superior in spatial blending. Prompt guidance in PAID enables the generation of compositional scenes and offers control over the interpolation path.	The selection of optimal hyperparameters for beta distribution requires Bayesian optimization, adding computational overhead. The effectiveness of prompt guidance can be sensitive to the choice of warm-up steps.	text-to-image synthesis, diffusion models, image interpolation, attention mechanism, prompt engineering
2403.17898 Report	Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians	Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai	The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularly noticeable in zoom-out views and can lead to inconsistent rendering speeds in scenes with varying details. Moreover, it often struggles to capture the corresponding level of details at different scales with its heuristic density control operation. Inspired by the Level-of-Detail (LOD) techniques, we introduce Octree-GS, featuring an LOD-structured 3D Gaussian approach supporting level-of-detail decomposition for scene representation that contributes to the final rendering results. Our model dynamically selects the appropriate level from the set of multi-resolution anchor points, ensuring consistent rendering performance with adaptive LOD adjustments while maintaining high-fidelity rendering results.	\modelname introduces a novel Level-of-Detail (LOD) structure to 3D Gaussian Splatting using an octree for hierarchical organization of anchor Gaussians, enabling consistent real-time rendering in large scenes.	Existing 3D Gaussian Splatting methods struggle with inconsistent rendering speeds and compromised quality in large, detail-rich scenes due to the lack of LOD awareness.	An octree partitions the scene, assigning anchor Gaussians to LOD levels based on observation distance and scene richness. Progressive training refines anchors, and opacity blending ensures smooth LOD transitions during rendering.	\modelname achieves competitive rendering quality with significantly fewer Gaussian primitives compared to baselines, leading to faster rendering. The method effectively handles multi-resolution datasets and addresses aliasing issues inherent in previous approaches. Ablation studies demonstrate the effectiveness of the LOD structure, adaptive anchor control, and progressive training.	The octree construction and progressive training require hyperparameter tuning. Future work includes addressing the inherent limitations of 3D-GS, such as dependency on initial sparse point clouds and lack of geometry support.	neural scene rendering, 3d gaussian splatting, consistent real-time rendering, level-of-detail, octree
2403.17888 Report	2D Gaussian Splatting for Geometrically Accurate Radiance Fields	Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, Shenghua Gao	3D Gaussian Splatting (3DGS) has recently revolutionized radiance field reconstruction, achieving high quality novel view synthesis and fast rendering speed without baking. However, 3DGS fails to accurately represent surfaces due to the multi-view inconsistent nature of 3D Gaussians. We present 2D Gaussian Splatting (2DGS), a novel approach to model and reconstruct geometrically accurate radiance fields from multi-view images. Our key idea is to collapse the 3D volume into a set of 2D oriented planar Gaussian disks. Unlike 3D Gaussians, 2D Gaussians provide view-consistent geometry while modeling surfaces intrinsically. To accurately recover thin surfaces and achieve stable optimization, we introduce a perspective-accurate 2D splatting process utilizing ray-splat intersection and rasterization. Additionally, we incorporate depth distortion and normal consistency terms to further enhance the quality of the reconstructions. We demonstrate that our differentiable renderer allows for noise-free and detailed geometry reconstruction while maintaining competitive appearance quality, fast training speed, and real-time rendering. Our code will be made publicly available.	This paper introduces 2D Gaussian Splatting (2DGS), a novel method for reconstructing geometrically accurate radiance fields from multi-view images, using 2D oriented planar Gaussian disks as primitives.	Existing methods like 3D Gaussian Splatting (3DGS) struggle to accurately capture intricate surface details. This new approach aims to improve geometric accuracy in radiance field reconstruction while maintaining high-quality novel view synthesis.	The method utilizes 2D Gaussian primitives, employs a perspective-accurate 2D splatting process leveraging ray-splat intersection and rasterization, and incorporates depth distortion and normal consistency terms to enhance reconstruction quality.	2DGS achieves state-of-the-art geometry reconstruction compared to other explicit representation methods on DTU and Tanks and Temples datasets. It offers competitive novel view synthesis results compared to leading implicit and explicit methods on the Mip-NeRF360 dataset. The method boasts significantly faster reconstruction times, approximately 100 times faster than implicit methods and more than 3 times faster than concurrent work.	2DGS assumes surfaces with full opacity, potentially causing inaccuracies when handling semi-transparent surfaces. The current densification strategy might not adequately represent fine geometric details in texture-less regions, requiring further investigation.	novel view synthesis, radiance fields, surface reconstruction, 2d gaussian splatting, differentiable rendering
2403.17870 Report	Boosting Diffusion Models with Moving Average Sampling in Frequency Domain	Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei	Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.	This paper introduces MASF (Moving Average Sampling in Frequency domain), a training-free method to enhance the stability of diffusion models during image generation.	Existing diffusion models often suffer from denoising instability due to relying solely on the current sample for denoising and not fully exploiting frequency evolution during generation.	MASF reinterprets denoising as model optimization and utilizes moving average on prior samples in the data space. It then leverages DWT to apply moving average separately on different frequency components, further enhanced by a dynamic weighting scheme that prioritizes low-frequency components initially and gradually shifts focus to high-frequency details.	MASF consistently improves FID scores across various datasets (ImageNet, MS-COCO, LSUN, FFHQ), especially for smaller NFEs where instability is more prominent. MASF is compatible with different solvers (DDIM, DPM-Solver++, UniPC, F-PNDM) and sampling techniques like Classifier Guidance, demonstrating its generalizability. Ablation studies confirm the effectiveness of each component in MASF, with moving average in the frequency domain and dynamic weighting contributing most significantly.	The paper primarily focuses on image generation and hasn't been explored for other diffusion model applications. Exploring more sophisticated frequency decomposition techniques beyond DWT might further enhance MASF.	diffusion models, image generation, denoising stability, moving average, frequency domain
2403.17839 Report	ReMamber: Referring Image Segmentation with Mamba Twister	Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, Yanfeng Wang	Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve the state-of-the-art on three challenging benchmarks. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research.	This paper presents ReMamber, a novel architecture for Referring Image Segmentation (RIS) that leverages the Mamba framework for efficient and effective multi-modal understanding.	Existing transformer-based RIS models face limitations in efficiently capturing long-range visual-language dependencies due to quadratic computation costs. ReMamber addresses this by utilizing Mamba, which offers linear complexity.	ReMamber employs Mamba Twister blocks, consisting of visual state space (VSS) layers and a Twisting layer. The VSS layers process spatial features, while the Twisting layer injects textual information via global and local interactions, enhancing cross-modality communication using a twisting mechanism.	ReMamber achieves state-of-the-art results on three challenging RIS benchmarks: RefCOCO, RefCOCO+, and G-Ref. The proposed Mamba Twister outperforms other multi-modal fusion designs, including attention-based, in-context, and norm adaptation approaches. Ablation studies highlight the importance of both Channel and Spatial Scans within the twisting mechanism for effective modality fusion.	The current segmentation decoder uses a simple convolutional design, which could be improved by exploring more sophisticated multi-modal decoders. Further research is needed to address the sub-optimal compatibility of cross-attention mechanisms within the Mamba architecture.	referring image segmentation, multi-modal understanding, mamba architecture, state space models, vision-language fusion
2403.17823 Report	Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders	Alexandre Eymaël, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck	Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.	Introduces CropMAE, a self-supervised pre-training method using cropped image pairs with high asymmetric masking for learning object-centric representations, eliminating the need for video data.	Addresses limitations of Siamese MAEs relying on video data and extensive training by enabling faster and more efficient pre-training on image datasets while achieving competitive performance.	Trains a Siamese ViT encoder-decoder to reconstruct a highly masked random crop of an image using another crop as reference, exploring different cropping strategies and pushing masking ratio to 98.5%.	Achieves faster pre-training and better performance on DAVIS-2017 object propagation than SiamMAE trained on K400. Demonstrates learning object-centric representations from still images without explicit motion, challenging the assumption that motion is essential for such representations. Shows the effectiveness of extremely high masking ratios (98.5%) with only two visible patches, exceeding previous limits.	Scalability to larger models and datasets requires further investigation. Understanding the unique contributions of video frames beyond still images for pre-training is crucial.	self-supervised learning, masked autoencoders, siamese networks, image pre-training, video segmentation
2403.17804 Report	Improving Text-to-Image Consistency via Automatic Prompt Optimization	Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal	Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.	This paper introduces OPT2I, the first text-to-image (T2I) optimization-by-prompting framework, designed to enhance prompt-image consistency.	Existing methods for improving consistency often require modifying model weights, limiting their applicability. OPT2I addresses this by working exclusively in text space, making it compatible with various T2I models, even those accessible only through APIs.	OPT2I employs an iterative process involving a pre-trained T2I model, a large language model (LLM), and a consistency metric (e.g., decomposed CLIPScore or Davidsonian Scene Graph). The LLM refines user prompts by leveraging past prompt-score pairs to generate alternatives that maximize consistency.	OPT2I consistently improves prompt-image consistency, outperforming paraphrasing baselines and achieving up to 24.9% improvement over user prompts. The framework demonstrates robustness across various LLMs, T2I models, and consistency metrics. Qualitative analysis reveals that OPT2I emphasizes initially ignored visual elements by either adding detail or strategically reordering prompt components.	The method relies on the reliability of prompt-image consistency scores, which can be inaccurate due to limitations in current metrics (e.g., bag-of-words behavior in CLIP). The iterative optimization process introduces runtime overhead compared to directly using the user prompt.	text-to-image generation, prompt optimization, large language models, prompt-image consistency, in-context learning
2403.17782 Report	GenesisTex: Adapting Image Denoising Diffusion to Texture Space	Chenjian Gao, Boyan Jiang, Xinghui Li, Yingpeng Zhang, Qian Yu	We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.	Presents GenesisTex, a novel method for synthesizing textures on 3D geometries from text descriptions using texture space sampling in an image diffusion model.	Addresses limitations of existing methods (slow optimization in distillation-based and instability in inpainting-based) for generating high-quality textures directly from text input.	Adapts a pretrained image diffusion model (Stable Diffusion) to texture space. It utilizes texture space sampling for multi-view consistent generation, enhanced by style consistency mechanisms and dynamic alignment. Further refinement is achieved through reference-based inpainting and Img2Img on denser views.	Achieves state-of-the-art texture synthesis quality, surpassing baselines in FID/KID metrics and user studies. Generates detailed, clean, and naturally colored textures for diverse geometries within minutes. Demonstrates the effectiveness of texture space sampling, style consistency, and dynamic alignment in achieving multi-view consistency.	Significant memory cost limits the number of viewpoints during generation, requiring post-processing steps. Future work could explore hierarchical style consistency to reduce memory cost and investigate texture map generation compatible with PBR workflows.	texture synthesis, text-to-3d, image diffusion models, multi-view consistency, 3d content generation
2403.17765 Report	MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations	Yifan Yan, Ruomin He, Zhenghua Liu	We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. It dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only speeds up convergence but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code will be made public upon acceptance of the paper.	MUTE-SLAM, a real-time neural RGB-D SLAM system using multiple tri-plane hash-encodings for efficient and scalable scene representation, enabling detailed mapping in unknown indoor environments.	Existing neural implicit SLAM methods struggle with scalability and often require pre-defined scene boundaries, limiting their use in large and unknown environments.	The system dynamically allocates sub-maps with tri-plane hash-encoding for new regions. It jointly optimizes all currently observed sub-maps and camera poses, and employs global bundle adjustment for consistency.	Achieves state-of-the-art surface reconstruction quality on Replica, surpassing baselines in detail preservation. Demonstrates competitive tracking performance on ScanNet and TUM-RGBD, outperforming some methods even without pre-defined boundaries. Exhibits strong scalability on the large-scale Apartment dataset, maintaining efficient run-time performance.	Remains sensitive to illumination changes and depth measurement inaccuracies inherent to RGB-D sensors. Global bundle adjustment, based on random keyframe sampling, may inadequately optimize less frequently observed areas, potentially impacting reconstruction in those regions.	slam, neural implicit representation, tri-plane encoding, hash-encoding, multi-map representation
2403.17695 Report	PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition	Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley	We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba	This work introduces PlainMamba, a simple non-hierarchical State Space Model (SSM) for visual recognition that enhances the selective scanning process of Mamba for 2D image data processing.	Plain non-hierarchical visual encoders like ViT are favored for their simplicity and widespread adoption in vision foundation models, offering ease of feature integration across levels and modalities, scalability, and hardware optimization.	PlainMamba replaces hierarchical structures with identical blocks of constant width, eliminating the need for special tokens. It introduces 'Continuous 2D Scanning' for spatial continuity and 'Direction-Aware Updating' to encode directional information in selective scanning.	PlainMamba outperforms non-hierarchical counterparts, including SSMs and Transformers, on ImageNet1K classification, COCO object detection/instance segmentation, and ADE20K semantic segmentation. The model shows competitive performance compared to hierarchical models while maintaining simplicity. PlainMamba exhibits high efficiency with high-resolution inputs, requiring significantly less computation than ViTs in such cases.	The model's performance slightly lags behind hierarchical models on tasks that benefit from multi-resolution architectures. Future work could explore enhancements in efficiency for low-resolution inputs to match ViT's performance in that domain.	state space models, visual recognition, non-hierarchical architecture, continuous 2d scanning, direction-aware updating
2403.17638 Report	Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency	Yingjie Xu, Bangzhen Liu, Hao Tang, Bailin Deng, Shengfeng He	We propose a voxel-based optimization framework, ReVoRF, for few-shot radiance fields that strategically address the unreliability in pseudo novel view synthesis. Our method pivots on the insight that relative depth relationships within neighboring regions are more reliable than the absolute color values in disoccluded areas. Consequently, we devise a bilateral geometric consistency loss that carefully navigates the trade-off between color fidelity and geometric accuracy in the context of depth consistency for uncertain regions. Moreover, we present a reliability-guided learning strategy to discern and utilize the variable quality across synthesized views, complemented by a reliability-aware voxel smoothing algorithm that smoothens the transition between reliable and unreliable data patches. Our approach allows for a more nuanced use of all available data, promoting enhanced learning from regions previously considered unsuitable for high-quality reconstruction. Extensive experiments across diverse datasets reveal that our approach attains significant gains in efficiency and accuracy, delivering rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\% improvement in PSNR over existing few-shot methods. Code is available at https://github.com/HKCLynn/ReVoRF.	This paper presents ReVoRF, a voxel-based optimization framework for fast few-shot radiance field reconstruction that leverages the relative depth information within unreliable regions of synthesized novel views, enabling enhanced multi-view consistency learning.	Few-shot NeRF methods struggle to maintain geometric and texture accuracy due to the sparsity of input views. Utilizing unreliable areas in synthesized views, which contain relative depth information, can enhance multi-view consistency and improve reconstruction quality.	The method involves: 1) Synthesizing novel views from sparse inputs using depth-guided warping. 2) Identifying reliable and unreliable regions in warped views based on pixel correlation. 3) Introducing a bilateral geometric consistency loss that leverages color and density for reliable regions and relative depth for unreliable ones. 4) Employing a reliability-aware voxel smoothing procedure and a learning strategy that prioritizes reliable areas during training.	ReVoRF achieves state-of-the-art accuracy in PSNR and LPIPS on the Realistic Synthetic 360° dataset. It demonstrates superior performance in capturing fine details and preserving structural integrity compared to existing methods on both synthetic and real-world datasets. The method achieves fast reconstruction, with rendering speeds of 3 FPS and a training time of 7 minutes for a 360° scene.	The voxel-based nature of ReVoRF can lead to the smoothing of fine details in the reconstructed scenes. The method's performance in highly complex and large-scale scenes remains to be explored.	neural radiance fields, few-shot learning, view synthesis, 3d reconstruction, unreliability modeling
2403.17465 Report	LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection	Yunpeng Luo, Junlong Du, Ke Yan, Shouhong Ding	The evolution of Diffusion Models has dramatically improved image generation quality, making it increasingly difficult to differentiate between real and generated images. This development, while impressive, also raises significant privacy and security concerns. In response to this, we propose a novel Latent REconstruction error guided feature REfinement method (LaRE^2) for detecting the diffusion-generated images. We come up with the Latent Reconstruction Error (LaRE), the first reconstruction-error based feature in the latent space for generated image detection. LaRE surpasses existing methods in terms of feature extraction efficiency while preserving crucial cues required to differentiate between the real and the fake. To exploit LaRE, we propose an Error-Guided feature REfinement module (EGRE), which can refine the image feature guided by LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an align-then-refine mechanism, which effectively refines the image feature for generated-image detection from both spatial and channel perspectives. Extensive experiments on the large-scale GenImage benchmark demonstrate the superiority of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1% average ACC/AP across 8 different image generators. LaRE also surpasses existing methods in terms of feature extraction cost, delivering an impressive speed enhancement of 8 times.	This paper introduces LaRE², a novel method for detecting diffusion-generated images using latent reconstruction errors and an error-guided feature refinement module.	The rise of highly realistic diffusion models necessitates robust detection methods to address privacy and security concerns arising from the potential misuse of generated images.	LaRE² extracts Latent Reconstruction Error (LaRE) in the latent space through single-step reconstruction. Then, it uses an Error-guided Feature REfinement module (EGRE) to refine image features spatially and channel-wise based on LaRE, improving discriminative capability for generated image detection.	LaRE² significantly outperforms existing methods, achieving up to 11.9%/12.1% ACC/AP gain on the large-scale GenImage benchmark. LaRE feature extraction is 8 times faster than previous reconstruction-based methods. Ablation studies confirm the effectiveness of EGRE and the robustness of LaRE² to hyperparameter choices like noise ensemble size and sample step.	The model's generalizability to entirely unseen diffusion models or future, more advanced generative models needs further investigation. Further research can explore incorporating class-specific prompts or leveraging textual information for more informative LaRE extraction.	diffusion model, image generation, image forensics, reconstruction error, feature refinement
2403.17422 Report	InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion	Jihyun Lee, Shunsuke Saito, Giljoo Nam, Minhyuk Sung, Tae-Kyun Kim	We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.	This paper introduces InterHandGen, a novel framework that learns a generative prior of two-hand interactions, enabling the generation of plausible and diverse two-hand shapes with or without an object.	Modeling two-hand interactions is crucial for capturing human behavior, with applications in AR/VR and HCI. Existing methods primarily focus on reconstruction, while generative modeling remains underexplored.	The framework decomposes the complex joint distribution of two hands into unconditional and conditional single-hand distributions, learned using a cascaded diffusion model with conditioning dropout. It employs anti-penetration and classifier-free guidance during inference to ensure plausibility and diversity.	InterHandGen outperforms baseline generative models in terms of plausibility and diversity, as measured by newly introduced two-hand interaction generation metrics. The framework effectively generalizes to two-hand and object interactions, demonstrating superior performance on the ARCTIC dataset. Integrating the learned prior into downstream tasks, such as monocular two-hand reconstruction, results in improved accuracy, achieving state-of-the-art results.	The current prior, while effective for two-hand interactions, does not yet offer a significant advantage as a universal hand prior across all hand-related tasks. Future work includes exploring temporal extensions for generating hand interaction sequences and expanding the framework to other interaction synthesis problems beyond hands.	two-hand interaction, generative prior, diffusion model, cascaded inference, hand pose estimation
2403.17410 Report	On permutation-invariant neural networks	Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, Yuki Saito	Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural network architectures such as Deep Sets and Transformers has presented a significant advancement in the treatment of set-based data. These architectures are specifically engineered to naturally accommodate sets as input, enabling more effective representation and processing of set structures. Consequently, there has been a surge of research endeavors dedicated to exploring and harnessing the capabilities of these architectures for various tasks involving the approximation of set functions. This comprehensive survey aims to provide an overview of the diverse problem settings and ongoing research efforts pertaining to neural networks that approximate set functions. By delving into the intricacies of these approaches and elucidating the associated challenges, the survey aims to equip readers with a comprehensive understanding of the field. Through this comprehensive perspective, we hope that researchers can gain valuable insights into the potential applications, inherent limitations, and future directions of set-based neural networks. Indeed, from this survey we gain two insights: i) Deep Sets and its variants can be generalized by differences in the aggregation function, and ii) the behavior of Deep Sets is sensitive to the choice of the aggregation function. From these observations, we show that Deep Sets, one of the well-known permutation-invariant neural networks, can be generalized in the sense of a quasi-arithmetic mean.	This paper surveys neural network architectures for approximating set functions. It particularly highlights Deep Sets and its variants, emphasizing their generalization potential through different aggregation functions, especially quasi-arithmetic means.	With the growing need to process set-based data in machine learning, understanding and improving neural networks capable of handling permutation-invariant inputs like sets is crucial.	The paper reviews existing architectures like Deep Sets, PointNet, and Set Transformers, analyzing their strengths, limitations, and theoretical properties. It connects them through the lens of Janossy pooling and explores the impact of aggregation functions. It introduces "H"{o}lder's Power Deep Sets", a novel generalization based on power mean, and evaluates its performance on various datasets.	Deep Sets, PointNet, and Set Transformers can be unified and analyzed under the framework of Janossy pooling. Theoretical analysis reveals limitations in Deep Sets' expressive power depending on latent space dimensionality and set size. Experiments show that H"{o}lder's Power Deep Sets, with its power mean aggregation, can outperform standard Deep Sets and PointNet depending on the dataset and optimized power exponent.	The paper primarily focuses on linear cases for H"{o}lder's Power Deep Sets. Further investigation is needed for non-linear scenarios. While promising, the proposed generalization requires more extensive experimental validation and theoretical analysis across diverse datasets and tasks.	set function approximation, permutation invariance, deep sets, janossy pooling, power mean
2403.17377 Report	Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance	Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, Seungryong Kim	Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.	This paper introduces Perturbed-Attention Guidance (PAG), a novel sampling guidance method for diffusion models that enhances sample quality by perturbing the self-attention maps in the model's U-Net architecture.	Existing guidance methods like Classifier-Free Guidance (CFG) rely on additional training or external modules, are not applicable for unconditional generation, and may decrease sample diversity. PAG addresses these limitations.	PAG perturbs the self-attention maps in the diffusion U-Net by replacing them with identity matrices, disrupting structural information while preserving appearance. This perturbed output guides the denoising process towards more structurally coherent samples.	PAG significantly improves FID and IS scores in both conditional and unconditional image generation with ADM and Stable Diffusion. PAG complements CFG, leading to further quality improvements when used together. PAG enhances performance in downstream tasks like image restoration (PSLD) and ControlNet with empty prompts, where CFG is not applicable.	High guidance scales in PAG can lead to over-saturation, requiring careful scale calibration. PAG requires two forward passes per generation step, impacting computational efficiency.	diffusion models, image generation, sampling guidance, self-attention, unconditional generation
2403.17237 Report	DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion	Yuanze Lin, Ronald Clark, Philip Torr	We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that work with text input alone. To address this, we propose a two-stage Gaussian Splatting based approach that enforces geometric consistency among views. Initially, a coarse 3D generation undergoes refinement via geometric optimization. Subsequently, we use a ControlNet driven refiner coupled with the geometric consistency term to improve both texture fidelity and overall consistency of the generated 3D asset. Empirical evaluations across diverse textual prompts spanning various object categories demonstrate the efficacy of DreamPolisher in generating consistent and realistic 3D objects, aligning closely with the semantics of the textual instructions.	DreamPolisher, a novel text-to-3D generation method based on 3D Gaussian Splatting, generates high-quality and view-consistent 3D assets from textual descriptions.	Existing text-to-3D methods often struggle with view-consistency and lack intricate textural details. DreamPolisher addresses this gap by combining Gaussian Splatting with geometric diffusion and ControlNet refinement.	Two-stage approach: 1) Coarse optimization learns coarse 3D Gaussians from text using a point cloud diffusion model and ISM loss. 2) Appearance refinement enhances texture and consistency using a ControlNet-driven refiner and a novel view-consistency loss.	Significantly outperforms existing methods in visual quality and view consistency. Demonstrates robust generality across diverse object categories (food, vehicles, furniture, etc.). Generates high-fidelity 3D objects with fine details and accurate geometry.	Current implementation requires 30 minutes generation time per object. Relies solely on text prompts, limiting the ability to guide generation using images.	text-to-3d generation, gaussian splatting, geometric diffusion, controlnet, view consistency
2403.17213 Report	AnimateMe: 4D Facial Expressions via Diffusion Models	Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Stefanos Zafeiriou	The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.	Introduces AnimateMe, the first diffusion-based method for customizable 4D facial expression generation directly on the mesh space using Graph Neural Networks (GNNs) as denoising models.	Addresses the limitations of prior 4D facial expression generation methods, particularly in producing high-fidelity extreme expressions and capturing fine details, by leveraging the power of diffusion models.	Presents a novel mesh diffusion process using GNNs to capture mesh structure and introduces a consistent noise sampling strategy for smooth animations. The method is trained on deformations from a neutral mesh and conditioned on expression progression and intensity, enabling customization.	Achieves state-of-the-art performance on 4D expression synthesis, outperforming previous methods in both quantitative metrics (classification accuracy, specificity) and qualitative evaluations. Successfully generates high-fidelity extreme expressions, a challenge that previous methods struggled with. Demonstrates the adaptability of the method by extending it to textured 4D animation on a large-scale dataset, showcasing its potential for realistic and detailed facial animation.	Reliance on an expression progression signal for conditioning limits versatility. The diffusion-based approach can be computationally expensive, especially for high-resolution meshes, despite efforts to improve efficiency.	4d facial expression, diffusion models, graph neural networks, mesh animation, consistent noise sampling
2403.17064 Report	Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions	Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Vincent Tao Hu, Björn Ommer	In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between ``person'' and ``old person''). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.	This paper introduces a method for fine-grained, subject-specific attribute control in text-to-image (T2I) generation by identifying semantic directions in token-level CLIP text embeddings, allowing manipulation of attributes like age, style, and even vehicle price.	Current methods for attribute control in T2I models either offer fine-grained global control or coarse subject-specific control, but not both. This work bridges this gap, enabling nuanced manipulation of specific subjects within complex scenes.	The method involves two approaches: (1) an optimization-free method that computes differences between CLIP embeddings of contrasting prompts (e.g., "young person" vs. "old person") and (2) a robust learning-based method that trains edit deltas using contrastive prompts to guide a diffusion model's predictions.	Identified directions in token-level CLIP embeddings effectively control attributes of specific subjects. Learned edit deltas capture semantic differences and are transferable across different prompts and subjects of similar categories. The method allows for compositional attribute editing, enabling control over multiple attributes of a single subject or different subjects within a scene.	The approach is limited by the diffusion model's capacity to disentangle attributes, potentially leading to unwanted correlations. Future work could explore combining this method with complementary approaches to further reduce attribute mixing between subjects.	text-to-image synthesis, diffusion models, attribute control, clip embeddings, semantic directions
2403.17008 Report	FlashFace: Human Image Personalization with High-fidelity Identity Preservation	Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo	This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a "child" or an "elder"). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc. Project Page: https://jshilong.github.io/flashface-page.	This paper introduces \method, a novel approach for human image personalization that preserves high-fidelity facial identity and follows text prompts effectively.	Existing methods for human image customization often struggle to balance preserving detailed facial features with accurately following text instructions, particularly when there's a conflict between the two.	The paper proposes two key innovations: 1) encoding reference faces into feature maps instead of tokens for detailed preservation and 2) a disentangled integration strategy to balance reference image and text prompt influence during generation. They also introduce a new ID dataset construction pipeline for training.	\method demonstrates superior identity preservation while effectively incorporating text prompts, even with conflicting instructions (e.g., changing age or gender). Increasing the number of reference images significantly improves identity fidelity, as evidenced by quantitative metrics and visual comparisons. Ablation studies highlight the importance of reference attention layer placement in the U-Net decoder and the role of reference strength parameters for fine-tuning generation.	The method may still produce artifacts in some generated images, suggesting limitations in the base model's capabilities. Controlling head pose through text prompts remains challenging, indicating an area for future improvement in controllability.	human image personalization, identity preservation, text-to-image generation, disentangled representation learning, reference attention
2403.17007 Report	DreamLIP: Language-Image Pre-training with Long Captions	Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen	Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (e.g., with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on a wide rage of downstream tasks demonstrate the consistent superiority of our method, termed DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs. Project page is available at https://zyf0619sjtu.github.io/dream-lip.	This paper studies the use of long captions generated by a pre-trained Multi-modality Large Language Model (MLLM) for improving language-image pre-training.	Existing language-image pre-training datasets use short captions that fail to capture the richness of real-world images. Long captions can provide a more detailed and comprehensive description, unlocking new potential for semantic understanding.	The authors re-caption 30M images with detailed descriptions using a pre-trained MLLM. They propose DreamLIP, a framework that dynamically samples sub-captions from the long captions to create multiple positive image-text pairs and utilizes a grouping loss to align sub-captions with their corresponding local image patches.	DreamLIP consistently outperforms previous state-of-the-art methods on a wide range of downstream tasks including image-text retrieval, semantic segmentation, and image recognition. Notably, DreamLIP trained on 30M image-text pairs achieves comparable or even superior performance to CLIP trained on 400M pairs for certain tasks. Analysis demonstrates the effectiveness of long captions and the proposed sampling and alignment strategy in enhancing fine-grained representation learning.	The work relies on the quality of the MLLM-generated captions, which can be prone to hallucinations. Future work could explore methods to mitigate the impact of potential hallucinations in long captions.	language-image pre-training, long captions, multi-modal learning, contrastive learning, fine-grained representation
2403.17005 Report	TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models	Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei	Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.	Presents TRIP, a novel image-to-video diffusion model that leverages temporal residual learning with image noise prior for coherent video generation.	Addresses the challenge of maintaining temporal coherence and alignment with the input image in image-to-video generation.	Calculates image noise prior from the input image and noisy video latent codes, then uses it as reference for residual noise prediction via a dual-path scheme with a 3D-UNet and a Transformer-based temporal noise fusion module.	Achieves state-of-the-art performance on WebVid-10M, DTDB, and MSR-VTT datasets, demonstrating superior temporal coherence and visual quality. Significantly outperforms baselines in terms of frame consistency (F-Consistency) and Frechet Video Distance (FVD). Shows strong generalization ability for customized image animation, enabling text-to-video generation and integration with image editing models.	Current implementation focuses on generating relatively short video clips. Exploring more sophisticated noise scheduling and sampling strategies for further quality improvement.	image-to-video generation, diffusion models, temporal residual learning, image noise prior, temporal coherence
2403.17004 Report	SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer	Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen	Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT. In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically, we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder, we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular, by encoding discriminative pairs with student and teacher DiT encoders, a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that, student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset, and our method achieves a competitive balance between training cost and generative capacity.	This paper proposes SD-DiT, a Diffusion Transformer architecture that leverages self-supervised discrimination knowledge distillation to enhance training efficiency and generative capacity.	Existing Diffusion Transformers suffer from slow convergence and limitations in mask strategies. This paper addresses these by introducing a novel approach to mask modeling based on self-supervised discrimination.	SD-DiT employs a teacher-student scheme with decoupled encoder-decoder structure. The teacher branch provides discriminative knowledge to the student branch, enhancing the generative diffusion process in the student branch. This is achieved through a novel discriminative loss that encourages inter-image alignment between teacher and student encoders.	SD-DiT achieves a better balance between training speed and generative performance compared to state-of-the-art DiT models. SD-DiT demonstrates superior FID scores compared to other DiT-based methods, especially with larger scale backbones. SD-DiT shows faster convergence speed, achieving comparable performance to other models with significantly fewer training steps.	The paper mainly focuses on image generation and evaluation on a single dataset (ImageNet). Exploring different self-supervised learning techniques beyond the teacher-student scheme could be a potential future direction.	diffusion models, diffusion transformer, self-supervised learning, image generation, mask modeling
2403.16999 Report	Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models	Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li	This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available to foster further research in this direction.	This paper introduces Visual CoT, a novel pipeline and dataset for enhancing Multi-Modal Large Language Models (MLLMs) with visual Chain-of-Thought (CoT) reasoning, improving their interpretability and ability to process complex visual inputs.	Existing MLLMs often lack interpretability and struggle with dynamic, multi-turn visual reasoning, hindering their efficacy in complex tasks.	The authors curate a Visual CoT dataset with 373k question-answer pairs annotated with bounding boxes highlighting key regions. They propose a multi-turn pipeline where the MLLM first identifies the key region, then uses both original and localized information for reasoning.	Visual CoT significantly improves performance on document/text-related tasks and high-resolution image processing. The model achieves comparative results on various multi-modal benchmarks, demonstrating enhanced visual understanding. Visual CoT outperforms previous state-of-the-art models on visual grounding benchmarks, highlighting its effectiveness in locating and understanding objects.	The model may struggle to identify the most relevant region in images with extensive information or complex questions. Future work can explore incorporating more sophisticated visual reasoning modules and extending the approach to other multi-modal tasks.	multi-modal language models, chain-of-thought reasoning, visual reasoning, interpretability, visual grounding
2403.16990 Report	Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation	Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or	Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.	Introduces Bounded Attention, a training-free method to improve semantic fidelity in multi-subject image generation with diffusion models by bounding information flow during sampling to mitigate semantic leakage between subjects.	Existing text-to-image diffusion models struggle to accurately generate scenes with multiple subjects, especially when they are semantically or visually similar due to attention mechanisms blending features and causing semantic leakage.	Bounded Attention operates in two modes: (1) Bounded Guidance: Backpropagates through the model to steer the latent signal towards desired layout using a loss based on attention map concentration. (2) Bounded Denoising: Uses masks to restrict attention and reduce semantic leakage during the denoising process, refined in later stages using self-attention map clustering.	Bounded Attention successfully generates multiple subjects with distinct features, even with complex layouts and semantically similar subjects. Outperforms baselines, including training-based methods, in qualitative comparisons demonstrating reduced semantic leakage and improved layout fidelity. Shows significant improvement in quantitative evaluation on DrawBench dataset, particularly in counting accuracy and spatial precision.	Residual leakage persists due to imperfect optimization during guidance and segmentation inaccuracies. Success is contingent on the match between seed and layout, necessitating future work on seed generation tailored to layouts.	text-to-image generation, diffusion models, semantic leakage, layout-to-image synthesis, bounded attention
2403.16954 Report	Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance	Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan	Large-scale text-to-image diffusion models have achieved great success in synthesizing high-quality and diverse images given target text prompts. Despite the revolutionary image generation ability, current state-of-the-art models still struggle to deal with multi-concept generation accurately in many cases. This phenomenon is known as ``concept bleeding" and displays as the unexpected overlapping or merging of various concepts. This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes, pursuing better text-image consistency. The core idea is to isolate the synthesizing processes of different concepts. We propose to bind each attachment to corresponding subjects separately with split text prompts. Besides, we introduce a revision method to fix the concept bleeding problem in multi-subject synthesis. We first depend on pre-trained object detection and segmentation models to obtain the layouts of subjects. Then we isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference. Overall, we achieve a training-free strategy, named Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is compatible with the latest Stable Diffusion XL (SDXL) and prior Stable Diffusion (SD) models. We compare our approach with alternative methods using a variety of multi-concept text prompts and demonstrate its effectiveness with clear advantages in text-image consistency and user study.	Introduces Isolated Diffusion, a training-free method to address the "concept bleeding" problem in multi-concept text-to-image generation with Stable Diffusion models.	Current text-to-image models struggle to maintain text-image consistency when generating images with multiple concepts, often leading to overlapping or merging of concepts.	Isolates the denoising processes for different concepts using split text prompts. Employs pre-trained object detection (YOLO) and segmentation (SAM) models to identify and revise concepts in generated images.	Achieves accurate assignment of attributes to multiple attachments within an image. Effectively revises images to separate and accurately depict multiple subjects, avoiding concept merging. Outperforms existing methods in maintaining text-image consistency, as demonstrated by qualitative and quantitative evaluations and user studies.	Relies on successful subject detection by YOLO, which may fail with unseen objects. Cannot correct for missing subjects if the initial generation by SD models is incomplete.	text-to-image generation, diffusion models, multi-concept generation, concept bleeding, stable diffusion
2403.16897 Report	Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text	Junshu Tang, Yanhong Zeng, Ke Fan, Xuheng Wang, Bo Dai, Kai Chen, Lizhuang Ma	Creating and animating 3D biped cartoon characters is crucial and valuable in various applications. Compared with geometry, the diverse texture design plays an important role in making 3D biped cartoon characters vivid and charming. Therefore, we focus on automatic texture design for cartoon characters based on input instructions. This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge, we propose Make-It-Vivid, the first attempt to enable high-quality texture generation from text in UV space. We prepare a detailed text-texture paired data for 3D characters by using vision-question-answering agents. Then we customize a pretrained text-to-image model to generate texture map with template structure while preserving the natural 2D image knowledge. Furthermore, to enhance fine-grained details, we propose a novel adversarial learning scheme to shorten the domain gap between original dataset and realistic texture domain. Extensive experiments show that our approach outperforms current texture generation methods, resulting in efficient character texturing and faithful generation with prompts. Besides, we showcase various applications such as out of domain generation and texture stylization. We also provide an efficient generation system for automatic text-guided textured character generation and animation.	Make-It-Vivid is the first attempt to generate high-quality textures in UV space for 3D biped cartoon characters from text input.	Texture design is crucial for creating vivid and charming 3D cartoon characters but current methods struggle with domain-specific requirements and limited high-quality data.	The authors 1) use vision-question-answering agents to create a text-texture paired dataset, 2) customize a pretrained text-to-image diffusion model to generate texture maps, and 3) introduce adversarial training to enhance fine-grained details.	Outperforms existing texture generation methods in quality and text-fidelity. Enables out-of-domain generation and texture stylization. Supports efficient text-guided character generation and animation.	Limited to 3D models with pre-defined topology and UV maps. Future work includes exploring automated meshing and texturing for arbitrary 3D models.	text-guided texture generation, 3d cartoon characters, uv space, diffusion models, adversarial training
2403.16885 Report	CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs	Yingji Zhong, Lanqing Hong, Zhenguo Li, Dan Xu	Neural Radiance Fields (NeRF) have shown impressive capabilities for photorealistic novel view synthesis when trained on dense inputs. However, when trained on sparse inputs, NeRF typically encounters issues of incorrect density or color predictions, mainly due to insufficient coverage of the scene causing partial and sparse supervision, thus leading to significant performance degradation. While existing works mainly consider ray-level consistency to construct 2D learning regularization based on rendered color, depth, or semantics on image planes, in this paper we propose a novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs. Specifically, we first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray, which are then incorporated into the volume rendering. By backpropagating through the rendering loss, we enhance the consistency among neighboring points. Additionally, we propose to use a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel. Experiments demonstrate that our method yields significant improvement over different radiance fields in the sparse inputs setting, and achieves comparable performance with current works.	This paper proposes CVT- x RF, a novel approach to improve Neural Radiance Fields (NeRF) performance with sparse inputs by modeling 3D spatial field consistency.	NeRF typically struggles with sparse inputs, leading to inaccurate density and color predictions and degraded performance, particularly due to insufficient scene coverage and sparse supervision.	The method employs a voxel-based ray sampling strategy and introduces a Contrastive In-Voxel Transformer (CVT) structure with local implicit and global explicit constraints to enforce 3D field consistency during training.	CVT- x RF significantly improves performance over various NeRF baselines, achieving state-of-the-art results on DTU and Synthetic datasets. The approach enhances 3D field consistency, evidenced by reduced floating artifacts and better object detail recovery in rendered images. CVT- x RF demonstrates fast convergence speed and learns more discriminative features compared to baseline models.	The performance improvement of object-level LPIPS evaluation for 6/9-view inputs on the DTU dataset is less pronounced. Future work includes exploring alternative sampling methods for encoding local context within voxels beyond sphere and line sampling.	neural radiance fields, novel view synthesis, sparse input, 3d field consistency, contrastive learning
2403.16848 Report	Multiple Object Tracking as ID Prediction	Ruopeng Gao, Yijun Zhang, Limin Wang	In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at https://github.com/MCG-NJU/MOTIP.	Presents MOTIP, a novel multiple object tracking system that formulates object association as an end-to-end ID prediction problem.	Existing tracking-by-detection methods rely on hand-crafted heuristics and surrogate tasks, while tracking-by-query methods suffer from training-inference discrepancy and potential conflicts between detection and association. MOTIP overcomes these limitations with a streamlined and end-to-end trainable pipeline.	MOTIP leverages a DETR detector, a learnable ID dictionary, and a transformer-based ID Decoder. Object embeddings from DETR and learnable ID embeddings are combined to form historical trajectories. The ID Decoder then predicts ID labels for new detections based on these trajectories.	Achieves state-of-the-art performance on DanceTrack and SportsMOT, outperforming previous methods by a significant margin. Performs competitively with other transformer-based methods on MOT17. Ablation studies validate the effectiveness of each component and the advantages of the ID prediction pipeline.	Lacks explicit motion modeling which can be crucial in crowded scenes. Trajectory representation could be further improved with more sophisticated temporal modeling.	multiple object tracking, tracking-by-detection, end-to-end tracking, id prediction, transformer
2403.16530 Report	An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models	Zizhao Hu, Shaochong Jia, Mohammad Rostami	Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.	This paper introduces an intermediate fusion mechanism for text-to-image diffusion models that improves text-image alignment and efficiency compared to the commonly used early fusion method.	Existing text-to-image diffusion models struggle to align generated images with high-level semantics in text and often introduce redundant computations due to early fusion of text embeddings.	The authors propose a U-ViT-based diffusion backbone with dedicated trainable layers for text and image, fusing them at intermediate layers. They compare this approach with early fusion under different conditioning methods (concatenation and cross-attention) on the MS-COCO dataset.	Intermediate fusion leads to better text-image alignment, evidenced by higher CLIP Scores and lower FID values compared to early fusion. Human evaluation confirms that intermediate fusion models generate images with more accurate object counts and are generally preferred over early fusion models. Intermediate fusion also improves efficiency by reducing FLOPs and increasing training speed due to fewer text-to-image attention calculations.	The study primarily focuses on concatenation and cross-attention conditioning methods, leaving other conditioning strategies unexplored. The impact of varying model parameters and hyperparameters for intermediate fusion, especially in scaled-up foundation models, requires further investigation.	diffusion models, text-to-image generation, multimodal fusion, vision-language alignment, intermediate fusion
2403.16510 Report	Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework	Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan Cao, Jintao Li, Tong-Yee Lee	Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{https://github.com/ICTMCG/Make-Your-Anchor}.	This paper introduces "Make-Your-Anchor," a diffusion-based system for generating personalized 2D avatar videos from one-minute video clips. This system accurately synthesizes full-body anchor videos with realistic torso and hand movements.	The proposed system addresses the limitations of current talking-head avatar systems that struggle to generate realistic full-body motions, particularly for anchor-style videos.	The system utilizes a two-stage training strategy for a structure-guided diffusion model. It first pre-trains on a multi-identity dataset for motion generation and then fine-tunes on a specific individual's video to bind appearance to motion. For temporal consistency and arbitrary video length, the system employs batch-overlapped temporal denoising during inference. It also includes an identity-specific face enhancement module for improving facial detail realism.	The system outperforms state-of-the-art GAN-based and diffusion-based methods in visual quality, temporal consistency, and identity preservation. A two-stage training strategy effectively binds motion to a specific individual's appearance, allowing for personalized avatar creation. Batch-overlapped temporal denoising enables the generation of long, temporally consistent videos without additional training.	The system may struggle to preserve appearance when presented with poses significantly different from those seen during fine-tuning. The current system does not model foreground occlusions, which may lead to ghosting artifacts. Future work could address this by explicitly segmenting and preserving occluded elements.	2d avatar generation, diffusion models, video generation, motion-to-appearance synthesis, identity preservation
2403.16379 Report	FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models	Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, Yu Wang	In recent years, there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately, the evaluation process could consume a significant amount of computational resources, making the required periodic evaluation of model performance (e.g., monitoring training progress) impractical. Therefore, we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem, and we propose FlashEval, an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations, including architectures, quantization levels, and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models, achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation, and open-source FlashEval as a tool for condensing future datasets, accessible at https://github.com/thu-nics/FlashEval.	This paper introduces \method{}, an iterative search algorithm that identifies representative subsets of text-image datasets for faster and more accurate evaluation of text-to-image diffusion generative models.	Evaluating text-to-image diffusion models is computationally expensive, especially when iterating on model design or training. Existing methods, like random subset sampling, offer poor accuracy-efficiency trade-offs. \method{} aims to improve this trade-off by finding small, highly representative subsets for evaluation.	\method{}, inspired by evolutionary algorithms, iteratively searches for representative prompts in the dataset. It combines the strengths of prompt-wise search (efficiency) and set-wise search (accuracy). It employs a frequency-based prompt selection strategy to identify prompts that consistently contribute to well-performing subsets. The search process involves constructing and evaluating numerous subsets based on Kendall's Tau (KD) correlation with the full dataset ranking and iteratively refining the selection of prompts.	\method{} significantly outperforms random sampling and baseline search methods, achieving high ranking correlation (KD) with smaller subset sizes (e.g., 50-item subset comparable to a 500-item random subset). The subsets found by \method{} generalize well to unseen models with different architectures, parameters, solvers, and step sizes. The search cost of \method{} can be further reduced by using a smaller randomly sampled subset as a proxy for the full dataset ranking during the search process.	The current implementation of \method{} primarily focuses on ranking tasks; extending it to other evaluation metrics could be explored. Further investigation into optimizing the search efficiency of \method{}, especially for very large datasets, is beneficial.	text-to-image generation, diffusion models, evaluation metrics, subset selection, evolutionary algorithms
2403.16368 Report	Distilling Semantic Priors from SAM to Efficient Image Restoration Models	Quan Zhang, Xiaoyu Liu, Wei Li, Hanting Chen, Junchao Liu, Jie Hu, Zhiwei Xiong, Chun Yuan, Yunhe Wang	In image restoration (IR), leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However, the computational cost of SAM is prohibitive for IR, compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue, we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically, our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionally, we design a semantic-guided relation (SGR) module for SPD, which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks, including deraining, deblurring, and denoising.	This paper introduces a novel framework designed to enhance existing image restoration (IR) models by distilling semantic knowledge from the Segment Anything Model (SAM) without compromising inference speed.	SAM, despite its potential for extracting rich semantic priors, presents a computational bottleneck for IR tasks due to its large size. This framework addresses this limitation, enabling the utilization of SAM's strengths without sacrificing efficiency.	The framework comprises two core schemes: Semantic Priors Fusion (SPF) fuses restored images from the IR model with SAM's semantic masks for refinement. Semantic Priors Distillation (SPD), incorporating a semantic-guided relation (SGR) module, transfers this fused knowledge to the original IR model, boosting its performance.	The framework consistently outperforms baseline IR models, demonstrating substantial improvements in both objective metrics (PSNR, SSIM) and subjective visual quality (FID) across various IR tasks. Evaluations on downstream segmentation tasks using cityscape-syn datasets further highlight the framework's efficacy, exhibiting consistent enhancements in IoU, PA, and DICE metrics. Ablation studies validate the contribution of individual components (SPF, SPD, SGR) within the framework, underscoring their significance in enhancing IR performance.	The framework necessitates the training of an additional IR model (f^IR2), potentially increasing training complexity. Future exploration could focus on extending the framework to incorporate semantic priors from diverse sources beyond SAM, further enriching its capabilities.	image restoration, semantic priors, segment anything model (sam), knowledge distillation, semantic-guided relation
2403.16365 Report	Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion	Hossein Souri, Arpit Bansal, Hamid Kazemi, Liam Fowl, Aniruddha Saha, Jonas Geiping, Andrew Gordon Wilson, Rama Chellappa, Tom Goldstein, Micah Goldblum	Modern neural networks are often trained on massive datasets that are web scraped with minimal human inspection. As a result of this insecure curation pipeline, an adversary can poison or backdoor the resulting model by uploading malicious data to the internet and waiting for a victim to scrape and train on it. Existing approaches for creating poisons and backdoors start with randomly sampled clean data, called base samples, and then modify those samples to craft poisons. However, some base samples may be significantly more amenable to poisoning than others. As a result, we may be able to craft more potent poisons by carefully choosing the base samples. In this work, we use guided diffusion to synthesize base samples from scratch that lead to significantly more potent poisons and backdoors than previous state-of-the-art attacks. Our Guided Diffusion Poisoning (GDP) base samples can be combined with any downstream poisoning or backdoor attack to boost its effectiveness. Our implementation code is publicly available at: https://github.com/hsouri/GDP .	The paper introduces Guided Diffusion Poisoning (GDP), a method that leverages guided diffusion models to synthesize highly potent poisoned training data for computer vision tasks.	Existing data poisoning and backdoor attacks often rely on randomly selected base samples, limiting their effectiveness. This work demonstrates that carefully chosen base samples can significantly enhance the potency of such attacks.	GDP employs a three-step process: (1) Generate base samples with a diffusion model, weakly guided by a poisoning loss function while maintaining clean labels. (2) Utilize these base samples as initialization for downstream poisoning algorithms. (3) Filter generated poisons, selecting those with the lowest poisoning loss.	GDP achieves significantly higher attack success rates compared to state-of-the-art targeted poisoning and backdoor attacks, even with very small poison budgets. The method is effective even with small perturbation budgets, making the poisons less detectable. GDP enhances the transferability of poisons, demonstrating improved performance in black-box settings where the victim model's architecture is unknown.	GDP requires a dataset-specific diffusion model, which can be computationally expensive to train. Generating and filtering a large number of poisons is inefficient; exploring more reliable optimization strategies could improve this aspect.	data poisoning, backdoor attacks, diffusion models, computer vision, adversarial machine learning
2403.16210 Report	Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane	Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, Hongdong Li, Pan Ji	We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.	Frankenstein, a novel tri-plane diffusion-based framework for generating semantic-compositional 3D scenes in a single pass.	Downstream applications often require semantically-decomposed 3D shapes, e.g., for realistic animation or part replacement. Existing methods struggle to generate such decompositions directly.	The method encodes multiple SDFs, each representing a semantic part, within a single tri-plane. It uses a three-stage training process: 1) per-scene tri-plane fitting, 2) VAE compression of tri-planes into a latent space, 3) diffusion model training on the latent space for controllable generation.	Frankenstein generates semantic-compositional 3D scenes for both rooms and avatars with clean part separation. The generated scenes allow for applications like part-wise texturing, object rearrangement, and cloth re-targeting. Coarse-to-fine optimization and semantic-aware point sampling during tri-plane fitting are crucial for high-quality reconstruction.	Limited details due to using a single tri-plane, potentially solvable by incorporating block-wise scene representation. Slow VAE training, requiring exploration of more efficient architectures.	3d scene generation, semantic composition, diffusion model, tri-plane representation, conditional generation
2403.16141 Report	Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes	Takashi Otonari, Satoshi Ikehata, Kiyoharu Aizawa	Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However, this approach faces challenges in modeling scene dynamics in urban environments, where moving objects of various categories and scales are present. In such settings, it becomes crucial to effectively eliminate moving objects to accurately reconstruct static backgrounds. Our research introduces an innovative method, termed here as Entity-NeRF, which combines the strengths of knowledge-based and statistical strategies. This approach utilizes entity-wise statistics, leveraging entity segmentation and stationary entity classification through thing/stuff segmentation. To assess our methodology, we created an urban scene dataset masked with moving objects. Our comprehensive experiments demonstrate that Entity-NeRF notably outperforms existing techniques in removing moving objects and reconstructing static urban backgrounds, both quantitatively and qualitatively.	This paper presents Entity-NeRF, a novel method for building NeRFs of dynamic urban scenes by identifying and removing multiple moving objects of various types and scales.	Existing NeRF methods struggle with the complexity of dynamic urban scenes, where numerous moving objects of different sizes and categories are present. Explicitly modeling scene dynamics or treating moving objects as outliers using existing approaches proves ineffective.	Entity-NeRF combines knowledge-based and statistical methods. It leverages entity segmentation for object identification, thing/stuff segmentation for stationary entity classification, and entity-wise statistics of reconstruction errors (EARR) for robust distractor labeling.	Entity-NeRF effectively removes moving objects and reconstructs static backgrounds in urban scenes, outperforming existing methods like RobustNeRF in terms of foreground and background PSNR. The method demonstrates robustness to variations in object scale and scene complexity, accurately identifying distractors without excessively excluding static elements. Stationary entity classification using thing/stuff segmentation significantly improves training efficiency and final PSNR by incorporating complex backgrounds from the early stages of training.	Entity-NeRF might face difficulties reconstructing backgrounds occluded by large moving objects. Shadows cast by moving objects are not explicitly handled and could be inadvertently incorporated into the training.	neural radiance fields, dynamic scenes, urban environments, entity segmentation, novel view synthesis
2403.16131 Report	Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement	Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen	DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention, which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues, we propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries, for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries, we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements, the proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets, as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.	This paper proposes Salience DETR, a novel end-to-end object detection framework that addresses scale bias and redundancy in two-stage DETR-like detectors through hierarchical salience filtering refinement.	Existing two-stage DETR methods suffer from heavy computational burden and scale bias in query selection, resulting in suboptimal performance, especially for small object detection.	Salience DETR introduces: (1) Scale-independent salience supervision for unbiased query filtering. (2) Hierarchical query filtering to encode only selected discriminative queries. (3) Query refinement modules to address semantic misalignment among queries.	Salience DETR achieves state-of-the-art performance on three task-specific detection datasets (ESD, CSD, MSSD) and competitive results on COCO 2017. It outperforms other methods with fewer FLOPs, demonstrating a better trade-off between computational efficiency and accuracy. The proposed scale-independent supervision and query refinement modules prove effective in mitigating scale bias and redundancy.	The redundancy removal for two-stage queries relies on hand-crafted NMS and lacks an end-to-end solution. Exploring the potential of salience supervision for pixel-level tasks like instance segmentation is a promising future direction.	object detection, detr, transformer, salience, query filtering
2403.16111 Report	EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing	Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang	Current diffusion-based video editing primarily focuses on local editing (\textit{e.g.,} object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage. To tackle this issue, we introduce EVA, a \textbf{zero-shot} and \textbf{multi-attribute} video editing framework tailored for human-centric videos with complex motions. We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that leverages the intrinsic positive and negative correspondences of cross-frame diffusion features. To avoid attention leakage, we utilize these correspondences to boost the attention scores of tokens within the same attribute across all video frames while limiting interactions between tokens of different attributes in the self-attention layer. For precise text-to-attribute manipulation, we use discrete text embeddings focused on specific layout areas within the cross-attention layer. Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping. Extensive experiments demonstrate EVA achieves state-of-the-art results in real-world scenarios. Full results are provided at https://knightyxp.github.io/EVA/	EVA, a zero-shot multi-attribute video editing framework for human-centric videos using a novel Spatial-Temporal Layout-Guided Attention mechanism.	Current video editing methods struggle with accurate multi-attribute editing while preserving layout and background, especially in videos with complex human motion.	EVA leverages: 1) Spatially disentangled semantic masks for layout information and accurate text-to-attribute control. 2) Cross-frame diffusion feature similarity to enhance attention scores within attributes and minimize attention leakage between them.	Achieves state-of-the-art results on benchmark datasets for both single and multi-object editing. Enables identity swapping in multi-object scenes. Outperforms existing methods in quantitative metrics (CLIP-T, Warp-error) and user studies evaluating subject edit accuracy, layout preservation, motion alignment, and overall preference.	Relies on user-provided layout masks, limiting scalability. Future work includes automating mask generation and exploring higher-resolution video editing.	video editing, text-to-video generation, diffusion models, attention mechanisms, layout preservation
2403.16095 Report	CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field	Jiarui Hu, Xianhao Chen, Boyin Feng, Guanglin Li, Liangjing Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui	Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: https://zju3dv.github.io/cg-slam.	This paper presents CG-SLAM, an efficient dense RGB-D SLAM system based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability.	Existing NeRF-based SLAM methods are computationally intensive and time-consuming, hindering their ability to achieve both accuracy and efficiency. This paper aims to address this challenge by leveraging the efficiency of 3D Gaussian Splatting while ensuring mapping and tracking quality.	The authors propose several techniques: 1) a CUDA framework for real-time dense RGB-D SLAM based on the derivatives of camera poses in 3D Gaussian Splatting, 2) a scale regularization term and depth alignment strategy to construct a consistent and stable 3D Gaussian field, and 3) a novel depth uncertainty model to select valuable Gaussian primitives for optimization.	CG-SLAM achieves superior tracking accuracy compared to NeRF-based SLAM methods on Replica, TUM-RGBD, and ScanNet datasets. CG-SLAM demonstrates state-of-the-art reconstruction quality with high mapping accuracy in observed areas. CG-SLAM achieves real-time performance with a tracking speed of up to 15 Hz due to its efficient Gaussian-based representation and GPU acceleration.	The Gaussian-based representation requires considerable memory usage. The method exhibits a weak prediction ability for unobserved areas.	dense visual slam, neural rendering, 3d gaussian field, uncertainty modeling, real-time
2403.16048 Report	Edit3K: Universal Representation Learning for Video Editing Components	Xin Gu, Libo Zhang, Fan Chen, Longyin Wen, Yufei Wang, Tiejian Luo, Sijie Zhu	This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existing visual representation learning of visual materials (i.e., images/videos), we aim to learn visual representations of editing actions/components that are generally applied on raw materials. We start by proposing the first large-scale dataset for editing components of video creation, which covers about $3,094$ editing components with $618,800$ videos. Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components. It can also benefit several downstream tasks, e.g., editing component recommendation, editing component recognition/retrieval, etc. Existing visual representation methods perform poorly because it is difficult to disentangle the visual appearance of editing components from raw materials. To that end, we benchmark popular alternative solutions and propose a novel method that learns to attend to the appearance of editing components regardless of raw materials. Our method achieves favorable results on editing component retrieval/recognition compared to the alternative solutions. A user study is also conducted to show that our representations cluster visually similar editing components better than other alternatives. Furthermore, our learned representations used to transition recommendation tasks achieve state-of-the-art results on the AutoTransition dataset. The code and dataset will be released for academic use.	This paper introduces Edit3K, the first large-scale dataset for learning representations of video editing components (e.g., effects, transitions, filters). It also proposes a novel embedding guidance architecture and contrastive loss for learning these representations.	Understanding video editing components is crucial for many downstream tasks like effect recommendation, detection, recognition, and automatic video editing. Existing datasets and methods are not designed for this task.	Edit3K dataset is created by rendering videos using existing image/video materials and a diverse set of editing components. The proposed model utilizes a guided spatial-temporal encoder, a guided embedding decoder, and an embedding queue mechanism to learn disentangled representations of editing components.	The proposed method significantly outperforms existing video representation learning approaches on editing component retrieval. User studies demonstrate that the learned embeddings cluster visually similar editing components better than alternative methods. The learned representations achieve state-of-the-art results on transition recommendation when applied to the AutoTransition dataset.	The model currently uses low frames per second, limiting its ability to handle fast motion. The model might struggle to recognize editing components with subtle changes without access to the raw, unedited video.	video editing, representation learning, dataset, contrastive learning, attention mechanism
2403.16020 Report	PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference	Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu	As deep neural networks evolve from convolutional neural networks (ConvNets) to advanced vision transformers (ViTs), there is an increased need to eliminate redundant data for faster processing without compromising accuracy. Previous methods are often architecture-specific or necessitate re-training, restricting their applicability with frequent model updates. To solve this, we first introduce a novel property of lightweight ConvNets: their ability to identify key discriminative patch regions in images, irrespective of model's final accuracy or size. We demonstrate that fully-connected layers are the primary bottleneck for ConvNets performance, and their suppression with simple weight recalibration markedly enhances discriminative patch localization performance. Using this insight, we introduce PaPr, a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets across a variety of deep learning architectures, including ViTs, ConvNets, and hybrid transformers, without any re-training. Moreover, the simple early-stage one-step patch pruning with PaPr enhances existing patch reduction methods. Through extensive testing on diverse architectures, PaPr achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction. More specifically, PaPr reduces about 70% of redundant patches in videos with less than 0.8% drop in accuracy, and up to 3.7x FLOPs reduction, which is a 15% more reduction with 2.5% higher accuracy.	Proposes PaPr, a training-free, one-step patch pruning method using lightweight ConvNets to accelerate inference in various deep learning models (ViTs, ConvNets, hybrid transformers).	Addresses limitations of existing patch pruning techniques that require retraining, perform gradual reduction, and lack architectural generality.	Leverages the inherent ability of lightweight ConvNets to identify discriminative regions by generating a Patch Significance Map (PSM) to guide patch pruning in larger models.	Achieves significantly higher accuracy with lower computational cost compared to state-of-the-art patch reduction methods. Demonstrates robustness in patch localization across varying ConvNet proposal models and challenging image scenarios. Effectively reduces spatio-temporal redundancy in videos, leading to substantial FLOPs reduction with minimal accuracy loss.	Current work focuses on discriminative tasks, future exploration in dense prediction tasks is promising. Further investigation into the impact of different upsampling methods on PSM generation and performance.	patch pruning, vision transformers, convolutional neural networks, efficient inference, computer vision
2403.16016 Report	Fill in the ____ (a Diffusion-based Image Inpainting Pipeline)	Eyoel Gebre, Krishna Saxena, Timothy Tran	Image inpainting is the process of taking an image and generating lost or intentionally occluded portions. Inpainting has countless applications including restoring previously damaged pictures, restoring the quality of images that have been degraded due to compression, and removing unwanted objects/text. Modern inpainting techniques have shown remarkable ability in generating sensible completions for images with mask occlusions. In our paper, an overview of the progress of inpainting techniques will be provided, along with identifying current leading approaches, focusing on their strengths and weaknesses. A critical gap in these existing models will be addressed, focusing on the ability to prompt and control what exactly is generated. We will additionally justify why we think this is the natural next progressive step that inpainting models must take, and provide multiple approaches to implementing this functionality. Finally, we will evaluate the results of our approaches by qualitatively checking whether they generate high-quality images that correctly inpaint regions with the objects that they are instructed to produce.	This paper presents "Fill in the ____," a diffusion-based image inpainting pipeline that allows users to specify an object to be inserted into a scene using a target image.	Existing inpainting models lack control over generated content, limiting their use in applications requiring specific object insertion. This work addresses this gap by enabling object-guided inpainting with diffusion models.	The pipeline builds upon the RePaint algorithm, incorporating a target image and mask as inputs. It modifies the denoising process by combining information from the target image with the generated inpainting, resolving mask conflicts and ensuring seamless object integration. Several masking techniques and lambda scheduling are explored to enhance boundary realism and control the influence of the target image.	The pipeline successfully inserts target objects into scenes with varying degrees of realism and faithfulness to the target, depending on chosen hyperparameters. Lambda scheduling, controlling the balance between the target image and the generated inpainting, proves crucial for achieving optimal results. Failure modes, such as high variance in generated content and biases from the DDPM training data, are identified.	Current limitations include reliance on manual mask creation and potential biases from the DDPM training data. Future work involves automating mask generation, exploring alternative masking techniques, and refining lambda scheduling for enhanced adaptability. The ultimate goal is to develop a fully automated inpainting pipeline.	image inpainting, diffusion models, generative ai, object insertion, repaint
2403.15789 Report	In-Context Matting	He Guo, Zixuan Ye, Zhiguo Cao, Hao Lu	We introduce in-context matting, a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points, scribbles, and masks, in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category, without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting, which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching, we introduce IconMatting, an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching, IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task, we also introduce a novel testing dataset ICM-$57$, covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at https://github.com/tiny-smart/in-context-matting	This paper introduces "in-context matting", a new image matting task that enables automatic alpha matte generation for a group of images with similar foregrounds using a single reference image and user-provided guidance (e.g., points, scribbles, masks) on that reference image.	In-context matting bridges the gap between accuracy and efficiency, and between customization and automation, by combining the advantages of automatic matting (efficiency) and auxiliary input-based matting (customization and accuracy).	The authors propose IconMatting, a model based on a pre-trained text-to-image diffusion model (Stable Diffusion) for in-context matting. IconMatting leverages inter-image similarity (matching between reference and target images) and intra-image similarity (self-attention within the target image) to accurately identify and extract the target foreground.	IconMatting achieves comparable accuracy to trimap-based matting while maintaining the automation level of automatic matting. A novel testing dataset, ICM-57, is introduced for benchmarking in-context matting. Experiments demonstrate the effectiveness of IconMatting in handling various foreground categories and scenes.	The performance of IconMatting improves with more reference inputs, but the gains diminish after a certain number. The current model is trained only on real-world datasets, and incorporating composited data could potentially further enhance performance.	image matting, in-context learning, diffusion models, stable diffusion, semantic correspondence
2403.15698 Report	SceneX:Procedural Controllable Large-scale Scene Generation via Large-language Models	Mengqi Zhou, Jun Hou, Chuanchen Luo, Yuxi Wang, Zhaoxiang Zhang, Junran Peng	Due to its great application potential, large-scale scene generation has drawn extensive attention in academia and industry. Recent research employs powerful generative models to create desired scenes and achieves promising results. However, most of these methods represent the scene using 3D primitives (e.g. point cloud or radiance field) incompatible with the industrial pipeline, which leads to a substantial gap between academic research and industrial deployment. Procedural Controllable Generation (PCG) is an efficient technique for creating scalable and high-quality assets, but it is unfriendly for ordinary users as it demands profound domain expertise. To address these issues, we resort to using the large language model (LLM) to drive the procedural modeling. In this paper, we introduce a large-scale scene generation framework, SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.Specifically, the proposed method comprises two components, PCGBench and PCGPlanner. The former encompasses an extensive collection of accessible procedural assets and thousands of hand-craft API documents. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions. Our SceneX can generate a city spanning 2.5 km times 2.5 km with delicate layout and geometric structures, drastically reducing the time cost from several weeks for professional PCG engineers to just a few hours for an ordinary user. Extensive experiments demonstrated the capability of our method in controllable large-scale scene generation and editing, including asset placement and season translation.	This paper introduces SceneX, a novel framework for generating large-scale 3D scenes from textual descriptions using Large Language Models (LLMs) and Procedural Content Generation (PCG).	SceneX bridges the gap between academic research and industrial applications by generating scenes directly compatible with industrial pipelines, unlike methods relying on point clouds or radiance fields.	SceneX uses PCGBench, a vast dataset of PCG assets and API documentation, and PCGPlanner, an LLM agent hierarchy for task planning, asset retrieval, and action execution in Blender.	SceneX generates highly realistic and detailed large-scale scenes, including natural environments and cities, significantly faster than previous methods and human experts. The generated scenes exhibit high aesthetic quality, surpassing existing text-to-3D and Blender-driven generation methods in user and expert evaluations. SceneX enables controllable and personalized scene editing, allowing users to modify generated assets and scenes based on their instructions.	SceneX's performance depends on the capabilities of the pre-trained LLM, potentially limiting its generalizability. The current version of PCGBench has a limited number of assets and APIs, which can restrict the diversity of generated scenes.	large-scale scene generation, llm agents, pcg, blender, text-to-3d
2403.15679 Report	DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes	Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang	Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods, DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally, we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.	This paper presents DS-NeRV, a new video INR that decomposes videos into separate learnable static and dynamic codes, improving compression and quality without explicit optical flow or residual supervision.	Existing NeRV methods struggle to efficiently compress videos due to mixing static and dynamic information, leading to difficulties in reducing redundancy and modeling temporal coherence. DS-NeRV aims to address these issues.	DS-NeRV uses sparse learnable static codes with weighted sum sampling and dynamic codes with interpolation sampling to represent video content. It employs a cross-channel attention-based fusion module to combine these codes for frame reconstruction.	DS-NeRV achieves state-of-the-art video reconstruction quality, outperforming previous NeRV methods on Bunny, UVG, and DAVIS datasets. The method demonstrates strong performance in downstream tasks like video interpolation and inpainting, highlighting its ability to capture temporal coherence. DS-NeRV exhibits efficient compression capabilities, achieving competitive results compared to traditional codecs like H.264 and HEVC.	Determining the optimal lengths for static and dynamic codes currently requires manual adjustment for each video. Finding the best dimensions for static and dynamic codes involves a testing phase.	video representation, implicit neural representations (inr), video compression, video inpainting, video interpolation
2403.15624 Report	Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting	Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li	Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems. Previous approaches haveadopted Neural Radiance Fields (NeRFs) to analyze 3D scenes. In this paper, we introduce SemanticGaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our keyidea is distilling pre-trained 2D semantics into 3D Gaussians. We design a versatile projection approachthat maps various 2Dsemantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, withoutthe additional training required by NeRFs. We further build a 3D semantic network that directly predictsthe semantic component from raw 3D Gaussians for fast inference. We explore several applications ofSemantic Gaussians: semantic segmentation on ScanNet-20, where our approach attains a 4.2% mIoU and 4.0%mAcc improvement over prior open-vocabulary scene understanding counterparts; object part segmentation,sceneediting, and spatial-temporal segmentation with better qualitative results over 2D and 3D baselines,highlighting its versatility and effectiveness on supporting diverse downstream tasks.	This paper proposes \method, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting by distilling knowledge from pre-trained 2D encoders.	Open-vocabulary 3D scene understanding is crucial for various real-world applications like robotics and augmented reality, enabling machines to interact effectively with diverse environments.	The method projects semantic features from pre-trained 2D models (e.g., OpenSeg, CLIP) onto 3D Gaussian points. Additionally, a 3D semantic network (MinkowskiNet) is introduced to predict semantic components directly from raw 3D Gaussians.	\method outperforms OpenSeg on ScanNet-20 semantic segmentation, demonstrating effective multi-view information integration. It achieves high-quality part segmentation consistent across different views, outperforming OpenSeg and LERF. The method exhibits promising results in spatiotemporal tracking and language-guided editing.	Scene understanding performance depends on the accuracy of 2D pre-trained models and the quality of 3D Gaussians. Future work includes exploring better 3D Gaussian representation and multi-modal pre-training.	open-vocabulary scene understanding, 3d gaussian splatting, semantic segmentation, part segmentation, spatiotemporal tracking
2403.15583 Report	U-ARE-ME: Uncertainty-Aware Rotation Estimation in Manhattan Environments	Aalok Patwardhan, Callum Rhodes, Gwangbin Bae, Andrew J. Davison	Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera rotation along with uncertainty from uncalibrated RGB images. Using a Manhattan World assumption, our method leverages the per-pixel geometric priors encoded in single-image surface normal predictions and performs optimisation over the SO(3) manifold. Given a sequence of images, we can use the per-frame rotation estimates and their uncertainty to perform multi-frame optimisation, achieving robustness and temporal consistency. Our experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and is more robust than sparse feature-based SLAM methods. We encourage the reader to view the accompanying video at https://callum-rhodes.github.io/U-ARE-ME for a visual overview of our method.	This paper presents U-ARE-ME, an algorithm that estimates camera rotation and uncertainty from uncalibrated RGB images using surface normal predictions and a Manhattan World assumption.	Accurate and robust rotation estimation from monocular images is crucial for various applications, especially in-the-wild videos where depth data or camera intrinsics are often unavailable. Existing methods struggle with textureless environments, image degradation, or require calibrated cameras.	The method leverages single-image surface normal predictions and optimizes camera rotation by aligning predicted normals to principal directions. It introduces an uncertainty-weighted cost function to handle unreliable predictions and performs multi-frame optimization using a factor graph for temporal consistency.	U-ARE-ME achieves comparable accuracy to RGB-D methods and outperforms feature-based SLAM (ORB-SLAM) in challenging, real-world scenarios (ScanNet). The method is robust to image degradation and does not require camera intrinsics, making it suitable for in-the-wild videos. The estimated up-vector enables applications like ground segmentation, demonstrating the versatility of the approach.	The accuracy depends on the quality of surface normal predictions, which can be affected by factors like object boundaries and small object size. The assumption of a Manhattan World may not hold true for all environments.	rotation estimation, manhattan world, surface normals, uncertainty quantification, temporal consistency
2403.15530 Report	Pixel-GS: Density Control with Pixel-aware Gradient for 3D Gaussian Splatting	Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, Hengshuang Zhao	3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results while advancing real-time rendering performance. However, it relies heavily on the quality of the initial point cloud, resulting in blurring and needle-like artifacts in areas with insufficient initializing points. This is mainly attributed to the point cloud growth condition in 3DGS that only considers the average gradient magnitude of points from observable views, thereby failing to grow for large Gaussians that are observable for many viewpoints while many of them are only covered in the boundaries. To this end, we propose a novel method, named Pixel-GS, to take into account the number of pixels covered by the Gaussian in each view during the computation of the growth condition. We regard the covered pixel numbers as the weights to dynamically average the gradients from different views, such that the growth of large Gaussians can be prompted. As a result, points within the areas with insufficient initializing points can be grown more effectively, leading to a more accurate and detailed reconstruction. In addition, we propose a simple yet effective strategy to scale the gradient field according to the distance to the camera, to suppress the growth of floaters near the camera. Extensive experiments both qualitatively and quantitatively demonstrate that our method achieves state-of-the-art rendering quality while maintaining real-time rendering speed, on the challenging Mip-NeRF 360 and Tanks & Temples datasets.	Pixel-GS enhances 3D Gaussian Splatting by enabling effective point growth in areas with insufficient initial points, thereby reducing blurring and needle-like artifacts.	The effectiveness of 3D Gaussian Splatting heavily relies on the quality of the initial point cloud. Inadequate initializing points lead to rendering artifacts.	Pixel-GS introduces a pixel-aware gradient that considers the number of pixels covered by each Gaussian in each view during the point cloud growth condition calculation. Additionally, it scales the gradient field according to the distance to the camera to suppress floaters.	Pixel-GS achieves state-of-the-art rendering quality on challenging datasets like Mip-NeRF 360 and Tanks & Temples. It significantly reduces blurring and needle-like artifacts in sparse regions. Pixel-GS demonstrates robustness to the sparsity of the initial point cloud.	The increased number of points in Pixel-GS leads to slightly higher memory consumption compared to 3DGS. The strategy to address floaters is inspired by NeRF's rendering mechanism and may not generalize well to other rendering techniques. Future work could investigate optimizing the trade-off between point cloud density and rendering efficiency.	view synthesis, point-based radiance field, real-time rendering, 3d gaussian splatting, adaptive density control
2403.15389 Report	DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data	Hanrong Ye, Dan Xu	Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.	This paper presents DiffusionMTL, a novel multi-task denoising diffusion framework designed to address noisy predictions in multi-task learning from partially annotated data.	Annotating multi-task datasets at pixel level is expensive, and training with partially annotated data often results in noisy predictions. Existing methods, though improving label efficiency, still suffer from this issue. Hence, a new methodology is needed to rectify noisy predictions and enhance multi-task prediction quality.	DiffusionMTL utilizes a two-step approach: (i) generating initial multi-task predictions with a shared encoder-decoder backbone and (ii) refining these predictions using a Multi-Task Denoising Diffusion Network (MTDNet). MTDNet employs two diffusion mechanisms: Prediction Diffusion (denoising in output space) and Feature Diffusion (refining in latent feature space). A Multi-Task Conditioning strategy is introduced to facilitate denoising and enable learning of unlabeled tasks via cross-task information sharing.	DiffusionMTL demonstrates substantial performance improvements, outperforming competing methods on three benchmarks (PASCAL, NYUD, Cityscapes) under different partial-labeling settings. Ablation studies confirm the effectiveness of the denoising network, multi-task conditioning, and both diffusion mechanisms. Qualitative analysis showcases DiffusionMTL's ability to effectively denoise noisy predictions and generate cleaner, more accurate multi-task prediction maps.	The current implementation primarily focuses on the one-label setting; exploring its generalization to more complex scenarios with varying label availability per task is a potential avenue. Further research on efficiently scaling DiffusionMTL to a larger number of tasks with diverse characteristics and computational demands is warranted.	multi-task learning, denoising diffusion models, partially supervised learning, dense prediction, computer vision
2403.15383 Report	ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars	Zhenwei Wang, Tengfei Wang, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau	Real-world applications often require a large gallery of 3D assets that share a consistent theme. While remarkable advances have been made in general 3D content creation from text or image, synthesizing customized 3D assets following the shared theme of input 3D exemplars remains an open and challenging problem. In this work, we present ThemeStation, a novel approach for theme-aware 3D-to-3D generation. ThemeStation synthesizes customized 3D assets based on given few exemplars with two goals: 1) unity for generating 3D assets that thematically align with the given exemplars and 2) diversity for generating 3D assets with a high degree of variations. To this end, we design a two-stage framework that draws a concept image first, followed by a reference-informed 3D modeling stage. We propose a novel dual score distillation (DSD) loss to jointly leverage priors from both the input exemplars and the synthesized concept image. Extensive experiments and user studies confirm that ThemeStation surpasses prior works in producing diverse theme-aware 3D models with impressive quality. ThemeStation also enables various applications such as controllable 3D-to-3D generation.	ThemeStation, a novel two-stage framework for theme-aware 3D-to-3D generation. It synthesizes diverse 3D assets thematically consistent with a few input exemplars, balancing unity and diversity.	Addresses limitations of text/image-based 3D generation (ambiguity, inconsistency) by leveraging richer information from 3D exemplars, enabling automatic synthesis of large, thematically consistent 3D asset galleries.	1. Theme-driven concept image generation: Fine-tunes a text-to-image diffusion model on exemplar renderings to generate diverse concept images. 2. Reference-informed 3D asset modeling: Uses concept images as rough guidance and refines them into 3D models via dual score distillation (DSD). DSD leverages concept prior (from concept images) at high noise levels for global layout and reference prior (from exemplars) at low noise levels for fine details.	Outperforms state-of-the-art image-to-3D and 3D variation methods in generative diversity, quality, and multi-view semantic coherence. Generates compelling and diverse 3D models with finer details, even with a single exemplar. Enables applications like controllable 3D-to-3D generation.	Current pipeline requires hours for optimization; future work can explore faster diffusion/rendering techniques. Reliance on a two-stage pipeline introduces potential for poor initialization; future work can explore feed-forward models.	3d generation, exemplar-based generation, diffusion models, dual score distillation, theme-aware generation
2403.15382 Report	DragAPart: Learning a Part-Level Motion Prior for Articulated Objects	Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi	We introduce DragAPart, a method that, given an image and a set of drags as input, can generate a new image of the same object in a new state, compatible with the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. To this end, we start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the new model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.	This paper introduces DragPart, an interactive image generator that synthesizes images of objects in new states compatible with user-specified drags, focusing on part-level interactions like opening drawers instead of just repositioning objects.	Current generative models struggle to capture nuanced part-level motion, often resorting to unrealistic object manipulation. DragPart addresses this by learning a generalist motion model applicable to diverse objects and their articulations.	The authors created DragBench, a synthetic dataset with drag annotations, by animating and rendering objects from GAPartNet. They then trained DragPart, which uses a novel drag encoding mechanism, on this dataset, leveraging pre-trained diffusion models like Stable Diffusion and DiT.	DragPart significantly outperforms state-of-the-art methods in quantitative metrics like PSNR, SSIM, and LPIPS on both synthetic and real-world datasets. Qualitative comparisons demonstrate DragPart's ability to generate realistic object articulations while preserving object identity and visual details. The learned motion model proves useful for downstream applications such as motion analysis for articulated objects and segmentation of moving parts.	The model currently lacks explicit enforcement of consistency for generated images of the same object across different viewpoints and drag conditions. The authors trained separate models for everyday objects and humans, limiting its generalizability to all moving entities.	generative models, motion synthesis, part-level interaction, drag-based control, synthetic data
2403.15378 Report	Long-CLIP: Unlocking the Long-Text Capability of CLIP	Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang	Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.	Introduces Long-CLIP, a plug-and-play alternative to CLIP that supports long-text input while retaining or surpassing CLIP's zero-shot generalizability.	CLIP's limited text input length (77 tokens, effectively only 20) hinders its ability to handle detailed descriptions and capture fine-grained information, limiting its applications in image retrieval and text-to-image generation.	Long-CLIP employs two novel strategies: 1) Knowledge-Preserved Stretching of positional embedding, preserving well-trained positions while interpolating others. 2) Primary Component Matching of CLIP features, aligning both fine-grained and coarse-grained image features with corresponding long and short captions.	Long-CLIP achieves up to 25% improvement in recall rate for long-text image retrieval tasks. It shows a 6% improvement in recall rate for traditional short-text image retrieval tasks on COCO and Flickr30k. Long-CLIP maintains zero-shot classification performance and enables plug-and-play integration for enhanced text-to-image generation with detailed prompts.	Long-CLIP still has an upper bound on input token length, though significantly extended. Future work includes exploring the impact of scaling up training data with long text-image pairs.	multimodality, zero-shot image classification, text-image retrieval, text-to-image generation, clip
2403.15377 Report	InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding	Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang	We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.	Introduces InternVideo2, a video foundation model (ViFM) for action recognition, video-text tasks, and video-centric dialogue, achieving state-of-the-art performance on 65 out of 74 video/audio tasks.	Transferable spatiotemporal representations are critical for diverse applications like video searching, robotics, and self-driving.	Employs progressive training with three stages: masked video token reconstruction, video-audio-speech-language contrastive learning, and next token prediction with a large language model (LLM).	Achieves new state-of-the-art results on Kinetics (92.1%/91.9%/85.9% on K400/600/700), SomethingSomething V2, Moments in Time, ActivityNet, and HACS. Outperforms previous state-of-the-art methods in zero-shot video retrieval across various benchmarks, demonstrating strong video-language alignment. Excels in video-centric dialogue and long video understanding, showing the ability to reason and comprehend long temporal contexts.	Limitations in input resolution, sampling rate, and compressed tokens restrict the expression of rich video information. Scalability and computational feasibility considerations limit joint learning of all optimization objectives.	video foundation model, multimodal learning, action recognition, video retrieval, video-centric dialogue
2403.15360 Report	SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series	Badri N. Patro, Vijay S. Agneeswaran	Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{https://github.com/badripatro/Simba}.	This paper proposes SiMBA, a novel architecture for vision and multivariate time series modeling that leverages the strengths of Mamba (a state-of-the-art State Space Model) for sequence modeling and introduces EinFFT, a new technique for channel modeling.	Existing State Space Models (SSMs) often struggle with information propagation in long sequences and lack efficient channel modeling techniques. SiMBA addresses these limitations, aiming to bridge the performance gap between SSMs and attention-based transformers.	SiMBA utilizes the Mamba block for sequence modeling to handle long-range dependencies and introduces EinFFT, based on Einstein FFT and learnable layers, for efficient and stable channel modeling.	SiMBA outperforms existing SSMs on ImageNet, achieving state-of-the-art performance for SSMs on this benchmark. The architecture demonstrates excellent generalization capabilities, achieving superior results on six standard time series datasets for multivariate forecasting. SiMBA shows competitive performance in transfer learning tasks on CIFAR, Stanford Car, and Flower datasets, as well as in downstream tasks like instance segmentation on MS COCO.	While SiMBA closes the performance gap for small and base-sized models, a gap still exists with large-sized transformers, requiring further exploration in scaling SiMBA. The paper primarily focuses on vision and time series data, leaving potential applications in other domains like natural language processing for future investigation.	state space models, transformers, channel modeling, sequence modeling, computer vision, time series forecasting
2403.15330 Report	Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization	Jimyeong Kim, Jungwon Park, Wonjong Rhee	In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.	This paper introduces SID (Subject-Informative Description) as a novel description format for personalized text-to-image diffusion models. SID aims to alleviate the problem of undesired embedding entanglement.	Existing personalized text-to-image models often exhibit undesired entanglement, limiting their ability to generate images that faithfully represent the intended subject in diverse contexts. SID addresses this by providing more informative descriptions that explicitly differentiate the subject from its background and associated objects.	The authors leverage the capabilities of instruction-following Vision-Language Models (VLMs) to automatically generate SIDs from reference images. These SIDs, incorporating details about both the subject and its surroundings, are then used to train personalized text-to-image diffusion models.	SID significantly improves the performance of personalized text-to-image generation, particularly in scenarios with highly biased reference images. The method proves effective even with a single reference image, surpassing the performance of both encoder-based and fine-tuning-based personalization methods. Human evaluation confirms the superiority of SID-integrated models, showcasing significant improvements in text alignment, subject preservation, and background disentanglement.	The study highlights the occasional imperfections in VLM-generated descriptions, which can sometimes lead to undesired artifacts in the generated images. The authors acknowledge the limitations of their evaluation measures in capturing style re-contextualization and plan to explore suitable measures for this aspect in future work.	text-to-image generation, personalized image synthesis, diffusion models, vision-language models, embedding entanglement
2403.15309 Report	Controlled Training Data Generation with Diffusion Models	Teresa Yeo, Andrei Atanov, Harold Benoit, Aleksandr Alekseev, Ruchira Ray, Pooya Esmaeil Akhoondi, Amir Zamir	In this work, we present a method to control a text-to-image generative model to produce training data specifically "useful" for supervised learning. Unlike previous works that employ an open-loop approach and pre-define prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system which involves two feedback mechanisms. The first mechanism uses feedback from a given supervised model and finds adversarial prompts that result in image generations that maximize the model loss. While these adversarial prompts result in diverse data informed by the model, they are not informed of the target distribution, which can be inefficient. Therefore, we introduce the second feedback mechanism that guides the generation process towards a certain target distribution. We call the method combining these two mechanisms Guided Adversarial Prompts. We perform our evaluations on different tasks, datasets and architectures, with different types of distribution shifts (spuriously correlated data, unseen domains) and demonstrate the efficiency of the proposed feedback mechanisms compared to open-loop approaches.	This paper presents a novel closed-loop method for generating useful training data for supervised learning models. It employs two feedback mechanisms to control a text-to-image generative model, specifically finding prompts that are both adversarial to the model and relevant to a target distribution.	This is important to address the limitations of static datasets and the need for adaptive, cost-efficient methods to improve model generalization under distribution shifts, especially in scenarios where real-world test conditions change over time.	The method uses 1) Adversarial Prompt Optimization to identify prompts that maximize the loss of a given supervised model, reflecting its failure modes, and 2) Target Distribution Informed Generation, implemented with CLIP guidance, to guide the generation process towards a target distribution, leveraging textual descriptions or unlabeled image samples.	Guided Adversarial Prompts (GAP) demonstrate higher data efficiency compared to open-loop and solely model/target-informed methods for image classification tasks, particularly on the Waterbirds and iWildCam datasets. Model-informed adversarial prompts significantly improve performance under distribution shifts for depth estimation tasks, outperforming baselines on Common Corruptions, 3D Common Corruptions, and cross-dataset shifts. The effectiveness of both model and target distribution feedback mechanisms is validated on different tasks (image classification, depth estimation), architectures (convolutional and transformer), and datasets exhibiting distribution shifts.	The method is currently limited by the potential for label shift in certain scenarios, such as changes in label distribution due to domain shifts. The computational cost of backpropagation through the diffusion model's denoising process can be demanding, presenting a limitation for scalability.	data augmentation, data generation, diffusion models, distribution shift, adversarial training
2403.15249 Report	Spectral Motion Alignment for Video Motion Transfer using Diffusion Models	Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye	The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.	This paper presents Spectral Motion Alignment (SMA), a frequency-domain motion vector refinement and alignment framework for improved motion transfer in videos using diffusion models.	Current video motion transfer methods, which rely on latent frame residuals as motion vectors, lack global motion context and are susceptible to frame-wise distortions, leading to inaccurate motion transfer.	SMA utilizes Fourier and wavelet transforms to refine and align motion vectors. It uses a wavelet-based global motion alignment loss to capture whole-frame motion dynamics and a Fourier-based local motion refinement loss to mitigate spatial artifacts, prioritizing low-frequency components.	SMA significantly improves motion accuracy in video motion transfer tasks, accurately distinguishing dynamic and static elements. SMA is computationally efficient and universally compatible with various video motion transfer frameworks, including text-to-video and text-to-image diffusion-based methods. Quantitative and qualitative evaluations demonstrate SMA's superiority over baselines like MotionDirector, VMC, Tune-A-Video, and ControlVideo across diverse motion patterns and subjects.	The selection of wavelet levels and frequency weighting parameters in SMA currently relies on empirical observations. Future work includes exploring the application of SMA to more complex video editing tasks beyond motion transfer, such as motion interpolation and video generation.	diffusion models, video motion transfer, wavelet transform, fourier transform, frequency-domain analysis
2403.15234 Report	Shadow Generation for Composite Image Using Diffusion model	Qingyang Liu, Junqi You, Jianting Wang, Xinhao Tao, Bo Zhang, Li Niu	In the realm of image composition, generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However, they are struggling to generate shadows with accurate shapes and intensities, hindered by data scarcity and inherent task complexity. In this paper, we resort to foundation model with rich prior knowledge of natural shadow images. Specifically, we first adapt ControlNet to our task and then propose intensity modulation modules to improve the shadow intensity. Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 datasets as well as real composite images demonstrate the superior capability of our model for shadow generation task. The dataset, code, and model are released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.	This paper introduces DESOBAv2, a large-scale shadow generation dataset, and proposes SGDiffusion, a novel diffusion-based method for generating plausible shadows for composite foregrounds.	Generating realistic shadows for inserted foregrounds in image composition is crucial for realism but challenging due to complex lighting and geometry. Existing methods struggle with data scarcity and generating accurate shadows.	The authors first create DESOBAv2 by using object-shadow detection and inpainting to automatically generate composite images without foreground shadows and their corresponding ground-truth images with shadows. They then develop SGDiffusion, which adapts ControlNet by adding an intensity encoder to modulate shadow darkness based on background shadows. They also introduce weighted noise loss to focus on the shadow region and employ post-processing to refine the generated image.	SGDiffusion outperforms previous state-of-the-art methods on both DESOBAv2 and real composite images, exhibiting superior performance in generating realistic shadows with accurate shapes, locations, and intensities. Ablation studies demonstrate the effectiveness of each component in SGDiffusion, including weighted noise loss, intensity modulation, and post-processing. Subjective evaluation using human raters further validates the superiority of SGDiffusion in producing realistic shadow effects.	The reliance on object-shadow detection in dataset construction might introduce bias from the detector's limitations. Future work can explore incorporating object material and lighting information for more accurate shadow generation.	shadow generation, image composition, diffusion models, dataset, deep learning
2403.15059 Report	MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration	Zhichao Wei, Qingkun Su, Long Qin, Weizhi Wang	Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.	MM-Diff is a tuning-free image personalization framework that enables fast generation of high-fidelity single and multi-subject images using vision-augmented text embeddings and detail-rich subject embeddings.	Existing personalized image generation methods struggle with slow generation speed, poor generalization, and the attribute binding issue in multi-subject scenarios.	MM-Diff leverages a vision encoder to extract subject features, employs a Subject Embedding Refiner to enhance these features, and integrates them into a diffusion model through LoRA layers. Cross-attention map constraints are introduced during training to address attribute binding in multi-subject generation.	MM-Diff achieves superior subject fidelity compared to other state-of-the-art tuning-free methods on single-subject generation. It achieves high face similarity scores for both single and multi-subject portrait generation. The proposed cross-attention map constraints effectively mitigate attribute binding in multi-subject generation.	The training dataset size is relatively limited compared to some top-tier methods. The dataset for general subject generation only contains one subject per image, limiting multi-subject generation capabilities. Future work could focus on using larger and more diverse datasets.	image personalization, subject fidelity, multi-subject generation, diffusion models, cross-attention
2403.15019 Report	BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation	Jiahao Lu, Jiacheng Deng, Tianzhu Zhang	3D instance segmentation (3DIS) is a crucial task, but point-level annotations are tedious in fully supervised settings. Thus, using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process, involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However, due to the presence of intersections among bboxes, not every point has a determined instance label, especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet), which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure, we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is available at \href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}.	This paper proposes BSNet, a weakly supervised 3D instance segmentation method that uses bounding boxes as annotations. It features a novel pseudo-labeler called SAFormer.	Point-level annotations for 3D instance segmentation are tedious. BSNet addresses this by using easier-to-obtain bounding box annotations while achieving high accuracy.	BSNet generates pseudo-labels using SAFormer, which leverages a Simulation-assisted Mean Teacher (SMT) strategy and a Local-Global Aware Attention (LGA) decoder. SMT constructs simulated overlapping samples to train the labeler, while LGA effectively models local and global structures within the point cloud.	BSNet outperforms previous box-supervised methods on ScanNetV2 and S3DIS benchmarks. The simulated samples and Mean Teacher strategy in SAFormer lead to higher-quality pseudo-labels and faster training. The LGA decoder effectively captures both local and global information, improving pseudo-label accuracy.	The simulated overlapping samples may not perfectly represent all real-world scenarios. Future work could explore extending BSNet to other weakly supervised 3D vision tasks.	3d instance segmentation, weakly supervised learning, bounding box supervision, mean teacher, transformer
2403.15009 Report	TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization	Jinbo Wu, Xing Liu, Chenming Wu, Xiaobo Gao, Jialun Liu, Xinqi Liu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang	This paper presents TexRO, a novel method for generating delicate textures of a known 3D mesh by optimizing its UV texture. The key contributions are two-fold. We propose an optimal viewpoint selection strategy, that finds the most miniature set of viewpoints covering all the faces of a mesh. Our viewpoint selection strategy guarantees the completeness of a generated result. We propose a recursive optimization pipeline that optimizes a UV texture at increasing resolutions, with an adaptive denoising method that re-uses existing textures for new texture generation. Through extensive experimentation, we demonstrate the superior performance of TexRO in terms of texture quality, detail preservation, visual consistency, and, notably runtime speed, outperforming other current methods. The broad applicability of TexRO is further confirmed through its successful use on diverse 3D models.	TexRO is a novel method for generating delicate textures of a known 3D mesh by optimizing its UV texture using recursive optimization at increasing resolutions with an adaptive denoising strategy and optimal viewpoint selection.	Controllable creation of detailed and delicate textures for 3D models remains challenging while existing methods suffer from limitations such as blurry results, lengthy optimization times, and the inability to maintain multi-view consistency.	TexRO uses an optimal viewpoint selection strategy based on a heuristic greedy strategy to find the smallest set of views covering all faces of a mesh. Then, it recursively optimizes the UV texture at increasing resolutions in RGB space with an adaptive denoising strategy that re-uses existing textures to generate new textures by adaptively injecting noise.	TexRO outperforms state-of-the-art methods in terms of texture quality, detail preservation, and visual consistency. TexRO achieves significantly faster texture generation, completing it in approximately 1 minute. Experiments on widely-used 3D datasets and user studies validate the effectiveness and efficiency of TexRO.	TexRO requires water-tight input meshes, limiting its application to non-water-tight meshes. Texturing areas within meshes with complex topologies can be challenging for TexRO.	texture generation, multi-view diffusion, recursive optimization, adaptive denoising, 3d model texturing
2403.14966 Report	DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow	Kyungmin Lee, Kihyuk Sohn, Jinwoo Shin	Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations.	This paper proposes DreamFlow, a text-to-3D generation method that leverages the generative process of text-to-image diffusion models by approximating the reverse generative probability flow, leading to faster optimization and high-quality results.	Existing score distillation methods for text-to-3D generation suffer from high-variance gradients, requiring lengthy optimization and limiting scalability to high-resolution 3D content.	The method interprets text-to-3D optimization as a multi-view image-to-image translation problem and uses a novel optimization algorithm based on approximate probability flow ODE (APFO) with a predetermined timestep schedule to transport multi-view images to the data distribution learned by a pre-trained diffusion model. A three-stage coarse-to-fine optimization framework generates NeRF, extracts and fine-tunes a 3D mesh, and refines the mesh with a high-resolution diffusion prior.	DreamFlow produces more photorealistic 3D content compared to existing methods like DreamFusion, Magic3D, and ProlificDreamer, as demonstrated by human preference studies. DreamFlow achieves better CLIP R-precision scores than prior methods in both NeRF generation and 3D mesh fine-tuning. DreamFlow is significantly faster (5x) than ProlificDreamer in generating 3D content.	The reliance on pre-trained diffusion priors without 3D understanding may limit results in some cases. Unwanted biases from the pre-trained diffusion model might be inherited.	text-to-3d generation, diffusion models, probability flow ode, neural radiance fields, 3d mesh
2403.14944 Report	CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model	Seungdae Han, Joohee Kim	There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion	This paper proposes CLIP-VQDiffusion, a novel text-conditional image generation model that leverages CLIP for multimodal representations and a vector quantized diffusion model for image generation, enabling language-free training using only image datasets.	Creating large text-image paired datasets for training text-conditional image generation models is expensive and laborious. This approach addresses the challenge by utilizing CLIP's ability to connect visual and textual information without requiring paired data.	The method involves pretraining a VQ-GAN to learn a codebook for image quantization. A diffusion model is then trained to denoise noisy latent codes conditioned on CLIP image embeddings. During inference, text prompts are transformed into CLIP text embeddings to guide image generation.	CLIP-VQDiffusion outperforms previous state-of-the-art language-free methods on FFHQ by 4.4% in CLIP score. The model generates highly realistic and text-aligned images on both FFHQ and COCO datasets. Gaussian noise injection to CLIP image embeddings during training proves crucial for bridging the gap between image and text embeddings.	The model exhibits a trade-off between image fidelity (FID) and diversity (IS) when varying guidance scale and truncation ratio. Further investigation into mitigating this trade-off and improving performance on datasets like COCO is needed.	text-to-image generation, language-free training, clip, vq-diffusion, multimodal learning
2403.14939 Report	STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians	Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, Yao Yao	Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.	This paper introduces STAG4D, a novel framework leveraging pre-trained diffusion models and dynamic 3D Gaussian splatting for high-fidelity 4D generation.	High-quality 4D content generation is crucial for various applications, but existing methods face challenges in rendering quality, spatial-temporal consistency, and generation speed.	The method uses a multi-view diffusion model to generate consistent multi-view image sequences. It then employs score distillation sampling to optimize a 4D Gaussian point cloud representation of the dynamic scene, aided by an adaptive densification strategy.	STAG4D achieves state-of-the-art results in 4D generation from text, image, and video inputs, demonstrating superior rendering quality and spatial-temporal consistency. The method exhibits robustness and generalizability across diverse dynamic scenes. Adaptive densification based on Gaussian gradient distribution proves effective for robust 4D Gaussian optimization.	The approach may be limited in handling complex, fast motions due to constraints of 4D Gaussian representation. Inherent video limitations, such as blurriness, can impact diffusion effectiveness and subsequent 4D optimization.	4d generation, 3d gaussian splatting, diffusion models, spatial-temporal consistency, adaptive densification
2403.14870 Report	VidLA: Video-Language Alignment at Scale	Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi	In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.	This paper introduces VidLA, a new approach for video-language alignment that scales effectively by utilizing large language models (LLMs) for data curation and a novel hierarchical temporal attention mechanism.	Video-language alignment is challenging due to the difficulty in gathering large, semantically aligned datasets and the complexity of capturing temporal dependencies in videos. Existing methods often struggle with these limitations, hindering performance.	The authors curate YT-VidLA-800M, a massive video-text dataset, by using LLMs to generate captions and summarize texts for better visual grounding. They design a hierarchical temporal attention mechanism that models both local and global temporal relationships in videos, enabling the use of pretrained image-text encoders.	VidLA surpasses state-of-the-art methods on multiple video-text retrieval benchmarks, particularly with longer videos. The hierarchical temporal attention mechanism effectively captures temporal dependencies at different scales, significantly improving performance. The data curation techniques, including caption generation and text summarization using LLMs, prove crucial for enhancing video-language alignment.	The model's performance could be further enhanced by exploring more advanced LLM architectures and data curation strategies. Investigating the effectiveness of VidLA on a broader range of downstream tasks, beyond retrieval and classification, would provide a more comprehensive evaluation.	video-language alignment, large language models, hierarchical temporal attention, video-text retrieval, data curation
2403.14828 Report	Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing	Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara	Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.	This paper introduces Textual-inverted Multimodal Garment Designer (Ti-MGD), a novel approach for multimodal-conditioned fashion image editing that leverages latent diffusion models to generate human-centric fashion images guided by text, human body poses, garment sketches, and fabric textures.	This task is important as it enables fashion designers to empower the design of new fashion items, facilitating the exploration of the interplay between their sketches, available fabric textures, and diverse human body shapes.	The authors propose a denoising network that takes multimodal prompts as input. They incorporate fabric textures by employing textual inversion techniques and designing a novel component to project texture images into the textual space of the diffusion model. Different cross-attention layers of the denoising network then attend to textual and texture information to incorporate different granularity conditioning details. The authors also define a semi-automatic framework for extending existing fashion datasets with multimodal data and introduce three novel evaluation metrics.	Ti-MGD outperforms state-of-the-art competitors in terms of realism and coherence concerning the provided multimodal inputs. The authors demonstrate the ability of the model to handle multiple conditions in a distinct manner efficiently. The proposed approach enables fine-grained control over the generated images without adding denoising network parameters.	The model may not fully capture body shape nuances solely from keypoints, necessitating exploration of dense or 3D pose representations. Texture conditioning might be limited when a sketch comprises distinct areas, prompting future research on spatial control for texture generation.	fashion product design, latent diffusion models, textual inversion, generative ai, multimodal learning
2403.14781 Report	Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance	Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, Siyu Zhu	In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance in curernt human generative techniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear) model as the 3D human parametric model to establish a unified representation of body shape and pose. This facilitates the accurate capture of intricate human geometry and motion characteristics from source videos. Specifically, we incorporate rendered depth images, normal maps, and semantic maps obtained from SMPL sequences, alongside skeleton-based motion guidance, to enrich the conditions to the latent diffusion model with comprehensive 3D shape and detailed pose attributes. A multi-layer motion fusion module, integrating self-attention mechanisms, is employed to fuse the shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as the motion guidance, we can perform parametric shape alignment of the human body between the reference image and the source video motion. Experimental evaluations conducted on benchmark datasets demonstrate the methodology's superior ability to generate high-quality human animations that accurately capture both pose and shape variations. Furthermore, our approach also exhibits superior generalization capabilities on the proposed wild dataset. Project page: https://fudan-generative-vision.github.io/champ.	This paper proposes Champ, a novel approach for human image animation that leverages the SMPL 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion guidance.	Current human image animation techniques using reference images and pose guidance often struggle with accurate pose alignment and motion guidance, especially when there are significant variations in body shape and intricate movements.	Champ utilizes the SMPL model to establish a unified representation of body shape and pose, enabling parametric shape alignment. It renders depth, normal, and semantic maps from SMPL sequences, along with skeleton-based motion guidance, to enrich the conditions for the latent diffusion model. A multi-layer motion fusion module with self-attention mechanisms fuses shape and motion latent representations, guiding the generation of high-quality human animation videos.	Champ outperforms state-of-the-art methods on benchmark datasets like TikTok, demonstrating superior performance in quantitative metrics and qualitative results. The method exhibits strong generalization capabilities, effectively animating images from unseen domains with variations in shape, pose, and appearance. Ablation studies confirm the contribution of each component, highlighting the importance of the SMPL model, multi-layer guidance, and self-attention mechanisms.	The modeling capacity of the SMPL model for faces and hands is limited, requiring additional constraints for optimal animation in those areas. Solving SMPL and DWpose independently introduces a potential discrepancy in consistency, which could be addressed in future work.	human image animation, latent diffusion model, 3d human parametric model, smpl, motion guidance
2403.14773 Report	StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text	Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi	Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video generation (typically 16 or 24 frames), ending up with hard-cuts when naively extended to the case of long video synthesis. To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. The key components are:(i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the previous chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module, which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that enables to apply a video enhancer autoregressively for infinitely long videos without inconsistencies between chunks. Experiments show that StreamingT2V generates high motion amount. In contrast, all competing image-to-video methods are prone to video stagnation when applied naively in an autoregressive manner. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator that outperforms competitors with consistency and motion. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V	Introduces StreamingT2V, an autoregressive text-to-video diffusion model for generating long, consistent videos with rich motion dynamics.	Existing text-to-video models struggle to create long, seamless videos and often suffer from stagnation or inconsistencies when extended temporally.	Combines a short-term memory module (CAM) for smooth chunk transitions and a long-term memory module (APM) for preserving object/scene features across generations. It also employs a randomized blending approach for enhancing long videos without chunk inconsistencies.	Generates consistent long videos with significantly higher motion amount compared to baselines. Successfully preserves object and scene details across long video generations, unlike many existing methods. Demonstrates superior performance in user studies regarding motion quality, text alignment, and overall video quality.	The model relies on a pre-trained text-to-video model and its performance is limited by the base model's capabilities. Training data for long, high-quality videos is limited, potentially impacting the model's ability to generate diverse and complex scenes over extended periods. Future work will explore training with large-scale datasets.	text-to-video generation, diffusion models, long video synthesis, temporal consistency, appearance preservation
2403.14760 Report	Can 3D Vision-Language Models Truly Understand Natural Language?	Weipeng Deng, Runyu Ding, Jihan Yang, Jiahui Liu, Yijiang Li, Xiaojuan Qi, Edith Ngai	Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling to understand sentences with the same semantic meaning but written in different variants. This observation raises a critical question: Can 3D vision-language models truly understand natural language? To test the language understandability of 3D-VL models, we first propose a language robustness task for systematically assessing 3D-VL models across various tasks, benchmarking their performance when presented with different language style variants. Importantly, these variants are commonly encountered in applications requiring direct interaction with humans, such as embodied robotics, given the diversity and unpredictability of human language. We propose a 3D Language Robustness Dataset, designed based on the characteristics of human language, to facilitate the systematic study of robustness. Our comprehensive evaluation uncovers a significant drop in the performance of all existing models across various 3D-VL tasks. Even the state-of-the-art 3D-LLM fails to understand some variants of the same sentences. Further in-depth analysis suggests that the existing models have a fragile and biased fusion module, which stems from the low diversity of the existing dataset. Finally, we propose a training-free module driven by LLM, which improves language robustness. Datasets and code will be available at github.	This paper introduces a new benchmark and dataset, called 3D Language Robustness (3D-LR), to evaluate how well 3D vision-language models understand natural language variations commonly used in human communication.	Existing 3D-VL models often struggle to understand sentences with the same meaning but expressed in different styles, hindering their use in real-world applications like robotics.	The authors define five key language characteristics (syntax, voice, modifier, accent, and tone) and use a large language model (LLM) to create paraphrased versions of sentences from existing 3D-VL datasets. They then evaluate various 3D-VL models on these paraphrased datasets.	Existing 3D-VL models, even those using LLMs, show significant performance drops (up to 32%) when presented with sentences rephrased using common language variations. The fusion module, responsible for combining visual and language features, is identified as a major point of failure due to its bias towards training data style. A simple LLM-based pre-alignment module is proposed, which improves robustness without retraining and achieves performance comparable to models trained on double the data size.	The 3D-LR dataset may not fully capture the entire spectrum of natural language variations. Future work should focus on more efficient data augmentation techniques and model architectures that generalize better to unseen language styles.	3d vision language, language robustness, open world understanding, natural language processing, embodied ai
2403.14623 Report	Simplified Diffusion Schrödinger Bridge	Zhicong Tang, Tiankai Hang, Shuyang Gu, Dong Chen, Baining Guo	This paper introduces a novel theoretical simplification of the Diffusion Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs), addressing the limitations of DSB in complex data generation and enabling faster convergence and enhanced performance. By employing SGMs as an initial solution for DSB, our approach capitalizes on the strengths of both frameworks, ensuring a more efficient training process and improving the performance of SGM. We also propose a reparameterization technique that, despite theoretical approximations, practically improves the network's fitting capabilities. Our extensive experimental evaluations confirm the effectiveness of the simplified DSB, demonstrating its significant improvements. We believe the contributions of this work pave the way for advanced generative modeling. The code is available at https://github.com/checkcrab/SDSB.	This paper presents Simplified Diffusion Schrödinger Bridge (S-DSB), a novel theoretical simplification of Diffusion Schrödinger Bridge (DSB) that enables its unification with Score-based Generative Models (SGMs) and improves its training efficiency and performance.	DSB holds theoretical advantages over SGMs for handling complex data and arbitrary distributions, but its slow convergence and training difficulties hinder practical application. This work bridges this gap and unlocks the potential of DSB in advanced generative modeling.	The authors propose a simplified optimization objective for DSB, demonstrating its equivalence to the original formulation while requiring fewer computations. This allows using pretrained SGMs as initialization for DSB, leading to faster convergence. Further, a reparameterization technique inspired by SGMs significantly enhances the network's fitting capabilities.	S-DSB, even with random initialization, matches vanilla DSB in performance but with faster training. Using pretrained SGMs as initialization significantly accelerates S-DSB convergence and improves generation quality. The proposed reparameterization method further boosts DSB's performance, surpassing vanilla DSB even with random initialization.	The convergence of DSB, even with the proposed improvements, remains computationally intensive, limiting scalability to larger datasets. The reparameterization technique in R-DSB relies on specific assumptions, which might introduce errors in practical scenarios, necessitating further research on improved approximations and error analysis.	diffusion schrödinger bridge, score-based generative models, generative modeling, reparameterization, optimal transport
2403.14621 Report	GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation	Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, Gordon Wetzstein	We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models. Our project website is at: https://justimyhxu.github.io/projects/grm/.	This paper introduces GRM, a novel large-scale, feed-forward 3D reconstruction model based on transformers and 3D Gaussian splatting for efficient and high-fidelity 3D object generation.	Existing 3D generative models either suffer from slow optimization processes or inefficient triplane representations. This new method leverages the efficiency of 3D Gaussians and a novel transformer architecture to improve both speed and quality.	GRM uses a transformer-based encoder-decoder network to predict pixel-aligned Gaussian attributes from multi-view images. It employs a novel transformer-based upsampler with windowed attention for efficient non-local information aggregation and detail reconstruction. These Gaussians are then splatted to render novel views.	GRM significantly outperforms previous state-of-the-art methods in sparse-view 3D reconstruction tasks, achieving higher fidelity with fewer input views. Combined with multi-view diffusion models, GRM achieves state-of-the-art quality and speed for text-to-3D and single image-to-3D object generation. Ablation studies demonstrate the effectiveness of each component, including the transformer-based upsampler, pixel-aligned Gaussian representation, and scale activation function.	The current model relies heavily on input view consistency and struggles with hallucination in unseen regions. The framework is limited to object-centric scenes due to the lack of large-scale 3D scene datasets.	3d reconstruction, 3d generation, gaussian splatting, transformers, sparse-view reconstruction
2403.14617 Report	Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion	Xiang Fan, Anand Bhattad, Ranjay Krishna	We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.	\methodname is a training-free video editing method that allows users to make localized semantic edits by modifying the first frame using any image editing tool and propagating these changes to the rest of the video.	Existing video editing methods lack the precision for localized edits, often relying on coarse textual instructions or requiring extensive fine-tuning.	The method leverages the near-linear trajectory of video latents during denoising diffusion and introduces: (1) inversion with noise extrapolation for accurate latent reconstruction, and (2) latent normalization and rescaling for consistency and quality.	\methodname enables diverse localized edits like object addition/removal, color changes, semantic edits, and appearance adjustments, while preserving source video fidelity. Outperforms 6 baselines on 2 editing benchmarks, using 10 evaluation metrics, demonstrating superior edit fidelity, source faithfulness, and temporal consistency. User study confirms \methodname's advantage in editing and video generation quality over text-based methods, with a 2.23x speedup compared to the average baseline.	Limitations: Information loss during VAE encoding and potential temporal inconsistency in videos with large movements. Future work: Combining image editing with motion controls for more seamless results and extending the method for 3D mesh editing.	video editing, diffusion models, semantic editing, noise extrapolation, latent space manipulation
2403.14614 Report	AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation	Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan	In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.	An adaptive all-in-one image restoration framework, called \xnet, is proposed, which leverages both spatial and frequency domain information to effectively decouple degradations from the desired clean image content.	Existing deep learning-based image restoration methods lack generalizability beyond specific degradation types or require training separate models for each task, which is computationally expensive and impractical for deployment on resource-constrained devices.	\xnet is based on a Transformer-based encoder-decoder architecture with Adaptive Frequency Learning Blocks (AFLBs). Each AFLB uses a Frequency Mining Module (FMiM) to extract low- and high-frequency feature maps guided by the adaptively decoupled spectra of the degraded image, and a Frequency Modulation Module (FMoM) to calibrate these features by enabling information exchange across different frequency bands.	\xnet achieves state-of-the-art performance on several all-in-one image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Under a three-degradation all-in-one setting (dehazing, deraining, denoising), \xnet outperforms the recent best method PromptIR by 0.63 dB PSNR. Under a five-degradation all-in-one setting, \xnet achieves a 1.86 dB gain compared to the recent best method IDR, when averaged across five restoration tasks.	The paper only evaluates the method on a limited set of degradation types. Further investigation is needed to explore the effectiveness of the method on real-world images with complex and mixed degradations.	image restoration, all-in-one model, frequency analysis, deep learning, transformer
2403.14613 Report	DreamReward: Text-to-3D Generation with Human Preference	Junliang Ye, Fangfu Liu, Qixiu Li, Zhengyi Wang, Yikai Wang, Xinzhou Wang, Yueqi Duan, Jun Zhu	3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D -- the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models.	Presents DreamReward, a novel text-to-3D generation framework that leverages human preference feedback (RLHF) to improve the alignment of generated 3D assets with human intentions.	Existing text-to-3D methods struggle to generate content that aligns well with human preferences, often resulting in outputs that lack in quality, text-3D alignment, and multi-view consistency.	1. Collects and annotates a diverse 3D dataset with human preference feedback, focusing on text-3D alignment, overall quality, and multi-view consistency. 2. Trains Reward3D, a 3D-aware scoring model, to effectively evaluate the quality of generated 3D content. 3. Introduces DreamFL (Reward3D Feedback Learning), an optimization algorithm that incorporates the Reward3D model to guide the training of multi-view diffusion models towards generating high-quality and human-preferred 3D assets.	DreamReward successfully generates 3D assets exhibiting superior text alignment, overall quality, and multi-view consistency compared to existing state-of-the-art methods. Reward3D demonstrates strong alignment with human preferences, making it a promising automatic evaluation metric for text-to-3D generation. DreamFL effectively utilizes the guidance of Reward3D to optimize 3D models, leading to significant improvements in generation quality and human preference alignment.	The diversity of generated 3D assets is limited by the size of the annotated dataset. Future work includes expanding the annotated dataset and incorporating more camera and orientation information into the Reward3D architecture.	3d generation, rlhf, human preference, text-to-3d, reward model
2403.14610 Report	T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy	Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang	We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.	The paper introduces T-Rex2, a novel open-set object detection model that unifies text and visual prompts within a single framework, enabling both generic and interactive object detection.	Open-set object detection, crucial for real-world applications, requires identifying objects beyond pre-defined categories. This work addresses limitations of existing methods relying solely on text or visual prompts by combining their strengths.	T-Rex2 utilizes a DETR-like architecture with parallel encoders for text and visual prompts. It employs contrastive learning to align both modalities, enabling them to benefit from each other's strengths.	T-Rex2 demonstrates strong zero-shot object detection capabilities, achieving state-of-the-art performance on COCO, LVIS, ODinW, and Roboflow100 benchmarks. The study reveals a complementary relationship between text and visual prompts, with text prompts excelling in common objects and visual prompts proving more effective for rare or hard-to-describe objects. The model's interactive capabilities are highlighted through its impressive performance in few-shot object counting tasks, showing its potential in applications like automatic annotation.	While showing promising results, the integration of text and visual prompts requires further refinement, particularly in scenarios with common objects where performance slightly dips. The current method requires up to 16 visual examples for reliable detection, necessitating further research to enhance the efficiency of visual prompts with fewer examples.	open-set object detection, text prompts, visual prompts, contrastive learning, interactive object detection
2403.14608 Report	Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey	Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang	Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.	This paper presents a comprehensive survey of Parameter-Efficient Fine-Tuning (PEFT) methods for large models, encompassing algorithmic designs, computational efficiency considerations, applications, and system implementation challenges.	The large size of modern models makes full fine-tuning computationally expensive and resource-intensive. PEFT offers a practical solution by adapting pre-trained models to specific tasks with minimal parameter adjustments, thereby reducing storage, memory, and computation costs.	The paper categorizes PEFT algorithms into four types: additive, selective, reparameterized, and hybrid. It discusses their mechanisms, advantages, limitations, and notable variations. Additionally, it explores strategies for enhancing PEFT efficiency, such as pruning, quantization, and memory optimization techniques.	The effectiveness of various PEFT methods can differ significantly across different tasks. Pruning and quantization can substantially enhance the efficiency of PEFT methods. Memory-efficient PEFT methods are crucial for reducing the memory overhead during training.	Lack of a unified benchmark for fair comparison of PEFT approaches. Need for improved training efficiency and simplified hyperparameter tuning in PEFT methods.	large language models, parameter-efficient fine-tuning, pruning, quantization, memory optimization
2403.14554 Report	Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering	Antoine Guédon, Vincent Lepetit	We propose Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time. Our approach builds on the recent 3D Gaussian Splatting framework, which optimizes a set of 3D Gaussians to approximate a radiance field from images. We propose first extracting a base mesh from Gaussians during optimization, then building and refining an adaptive layer of Gaussians with a variable thickness around the mesh to better capture the fine details and volumetric effects near the surface, such as hair or grass. We call this layer Gaussian Frosting, as it resembles a coating of frosting on a cake. The fuzzier the material, the thicker the frosting. We also introduce a parameterization of the Gaussians to enforce them to stay inside the frosting layer and automatically adjust their parameters when deforming, rescaling, editing or animating the mesh. Our representation allows for efficient rendering using Gaussian splatting, as well as editing and animation by modifying the base mesh. We demonstrate the effectiveness of our method on various synthetic and real scenes, and show that it outperforms existing surface-based approaches. We will release our code and a web-based viewer as additional contributions. Our project page is the following: https://anttwo.github.io/frosting/	Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time, building upon the 3D Gaussian Splatting framework.	Enables both efficient rendering using Gaussian splatting and easy editing and animation by modifying the base mesh, surpassing previous methods in quality and/or efficiency.	Extracts a base mesh from optimized Gaussians, builds an adaptive layer of Gaussians (Frosting) with variable thickness around the mesh based on Gaussian density, and parameterizes Gaussians to stay within the layer during mesh deformation.	Outperforms existing surface-based and even some non-editable volumetric methods in rendering quality on challenging datasets like Shelly and Mip-NeRF 360. Allows for efficient real-time rendering and editing due to its hybrid representation. Enables seamless animation by automatically adjusting Gaussian parameters based on mesh deformation.	Current implementation uses a simple, piecewise linear deformation model. Models are larger than vanilla Gaussian Splatting due to the inclusion of barycentric coordinates and mesh vertices.	gaussian splatting, mesh, differentiable rendering, 3d reconstruction, image-based rendering
2403.14530 Report	HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression	Yihang Chen, Qianyi Wu, Jianfei Cai, Mehrtash Harandi, Weiyao Lin	3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over $75\times$ compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over $11\times$ size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code is available here: https://github.com/YihangChen-ee/HAC	This paper proposes HAC, a Hash-grid Assisted Context framework for highly compact 3D Gaussian Splatting (3DGS) representation by exploiting spatial consistencies among unorganized 3D Gaussians.	3DGS offers fast and high-fidelity novel view synthesis but requires substantial storage space for storing Gaussian attributes, necessitating effective compression techniques.	HAC leverages a structured hash grid to model the context of anchor attributes in Scaffold-GS, predicting their value distributions for efficient entropy coding. It also incorporates an adaptive quantization module and a masking strategy for enhanced compression.	HAC achieves a remarkable size reduction of over 75x compared to vanilla 3DGS while improving fidelity. It outperforms SOTA 3DGS compression approaches like Scaffold-GS by achieving over 11x size reduction. The proposed context modeling and adaptive components are shown to effectively improve rate-distortion performance.	Integrating additional models in HAC increases training time compared to Scaffold-GS. Future work could explore faster entropy coding algorithms on CPU or GPU for reduced encoding/decoding time.	3d gaussian splatting, compression, context models, novel view synthesis, rate-distortion optimization
2403.14520 Report	Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference	Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang	In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.	Cobra, a multimodal large language model (MLLM) with linear computational complexity, addressing the inefficiency of quadratic complexity in Transformer-based MLLMs.	Existing MLLMs suffer from quadratic computational complexity due to the Transformer architecture, hindering their efficiency and practicality.	Cobra integrates the efficient Mamba language model (linear complexity) with visual modality using DINOv2 and SigLIP as encoders and explores various modal fusion schemes.	Cobra achieves competitive performance with state-of-the-art efficient MLLMs (LLaVA-Phi, TinyLLaVA, MobileVLM v2) with faster inference speed. Cobra excels in overcoming visual illusions and spatial relationship judgments in closed-set prediction benchmarks. Cobra exhibits comparable performance to the larger LLaVA model with only 43% of its parameters, highlighting its efficiency.	Cobra shows weaker performance in text recognition tasks compared to some baselines. Cobra's recurrent dynamics require relatively high numerical precision, limiting memory reduction through quantization.	multimodal large language model, mamba, state space model, computation efficiency, vision language model
2403.14487 Report	DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing	Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, Shanghang Zhang	Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.	This paper presents a training-free, forward-only, unified framework for accurate spatial-aware image editing tasks by decomposing the source image into multiple latent layers for independent manipulation and then fusing them into the target image.	Existing text-to-image generation models struggle with precise spatial arrangements and previous image editing methods lack the flexibility for complex multi-object manipulation. This method aims to bridge this gap, offering more control and precision in image editing.	The approach involves 1) Multi-layered latent decomposition: segmenting source image latent representations into object layers and a background layer, utilizing a key-masking self-attention scheme for accurate object removal and background inpainting. 2) Multi-layered latent fusion: pasting manipulated latent representations onto a canvas latent following user instructions or GPT-4V guidance, and refining the result with a harmonization process and an artifact suppression scheme.	Outperforms state-of-the-art methods like Self-Guidance and DiffEditor in image quality and editing fidelity based on user studies. Achieves high-quality object removal comparable to specifically trained inpainting models without requiring finetuning. Successfully unifies various spatial-aware image editing tasks, including object removal, resizing, movement, flipping, camera panning, zooming out, and cross-image composition, demonstrating strong generalizability.	The resolution difference between image and latent space can cause detail loss when resizing objects. Future work can explore further applications of the framework for more complex editing tasks.	image editing, latent diffusion models, spatial-aware editing, multi-layered representation, gpt-4v
2403.14468 Report	AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen	Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.	\model is a novel training-free, plug-and-play framework that simplifies video editing into two stages: (1) first-frame image editing using off-the-shelf models and (2) image-to-video generation via DDIM inversion and feature injection.	Existing video editing methods are limited to specific editing types and often require retraining or complex feature extraction. \model addresses these limitations by enabling a wide range of editing tasks within a unified, efficient, and user-friendly framework.	\model leverages pre-trained image editing and image-to-video generation models. It edits the first frame using an image editing model, then uses an I2V model to propagate the edit through the video while maintaining consistency with the source video's appearance and motion through feature injection.	\model outperforms the previous best approach in prompt-based editing by 35% on prompt alignment and 25% on human preference. \model demonstrates compatibility with various image editing models, enabling diverse tasks such as style transfer, subject-driven editing, and identity manipulation. Ablation studies confirm the importance of DDIM inversion, spatial and temporal feature injection for maintaining consistency and structure in edited videos.	The performance of \model depends on the accuracy of the initial first-frame edit, which can be limited by the capabilities of existing image editing models. \model's ability to handle fast or complex motion is constrained by the limitations of current I2V models.	video editing, diffusion models, plug-and-play, image-to-video generation, ddim inversion
2403.14376 Report	InfNeRF: Towards Infinite Scale NeRF Rendering with O(log n) Space Complexity	Jiabin Liang, Lanqing Zhang, Zhuoran Zhao, Xiangyu Xu	The conventional mesh-based Level of Detail (LoD) technique, exemplified by applications such as Google Earth and many game engines, exhibits the capability to holistically represent a large scene even the Earth, and achieves rendering with a space complexity of O(log n). This constrained data requirement not only enhances rendering efficiency but also facilitates dynamic data fetching, thereby enabling a seamless 3D navigation experience for users. In this work, we extend this proven LoD technique to Neural Radiance Fields (NeRF) by introducing an octree structure to represent the scenes in different scales. This innovative approach provides a mathematically simple and elegant representation with a rendering space complexity of O(log n), aligned with the efficiency of mesh-based LoD techniques. We also present a novel training strategy that maintains a complexity of O(n). This strategy allows for parallel training with minimal overhead, ensuring the scalability and efficiency of our proposed method. Our contribution is not only in extending the capabilities of existing techniques but also in establishing a foundation for scalable and efficient large-scale scene representation using NeRF and octree structures.	Presents InfNeRF, a novel Neural Radiance Field (NeRF) framework utilizing an octree structure for efficient large-scale scene representation and rendering.	Addresses the limitations of existing large-scale NeRF methods in handling bird's-eye views and aliasing artifacts, aiming for scalable and memory-efficient rendering.	Constructs an LoD octree where each node encapsulates a NeRF representing a specific region at a certain scale, enabling anti-aliasing rendering by querying appropriate nodes based on sampling point location and radius. Employs tree pruning for model sparsity and introduces a distributed training strategy for efficiency.	Achieves a rendering memory complexity of O(log n), significantly reducing memory footprint compared to baseline methods. Demonstrates superior rendering quality with over 2.4dB improvement in PSNR due to inherent anti-aliasing properties. Presents an efficient and scalable distributed training strategy, reducing VRAM consumption and communication overhead.	Reconstruction time and computational burden still need optimization compared to traditional photogrammetry methods. Exploring the fusion of octrees from diverse image sources and scales for reconstructing even larger scenes.	neural radiance fields, large-scale scene reconstruction, level of detail, octree, anti-aliasing
2403.14291 Report	Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models	Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C. SanMiguel, Jose M. Martínez	Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However, current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work, we introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition, we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining.	Introduces Open-Vocabulary Attention Maps (OVAM), a training-free method for text-to-image diffusion models enabling the generation of attention maps and semantic segmentation masks from open-vocabulary descriptions, independent of the image generation prompt.	Existing methods for generating semantic segmentation masks from diffusion models are primarily limited by the tokens present in the text prompt used for image synthesis, restricting their flexibility and open-vocabulary capabilities.	OVAM leverages cross-attention maps from diffusion models, using an independent 'attribution prompt' to generate attention maps for arbitrary words. It also introduces a token optimization process to learn accurate attention maps for specific object classes with just one annotation per class.	OVAM with token optimization outperforms existing training-free methods and achieves comparable or superior results to methods requiring additional training data. Token optimization through OVAM significantly improves the performance of existing Stable Diffusion-based segmentation methods without requiring architectural changes or retraining. Synthetic data generated using OVAM with token optimization effectively trains semantic segmentation models, achieving competitive results on standard benchmarks.	The current implementation of OVAM relies on a fixed threshold for binarizing attention maps, which could be further improved. Future work will explore extending OVAM to generate multi-class segmentation masks from a single attention map.	semantic segmentation, diffusion models, open-vocabulary, stable diffusion, attention maps
2403.14270 Report	Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection	Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer	Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.	This paper introduces Scene-Graph ViT, an efficient, end-to-end, open-vocabulary model for visual relationship detection using a Transformer-based encoder-only architecture and a novel Relationship Attention mechanism.	VRD facilitates structured scene understanding, crucial for robotics, image retrieval, and grounding language models. Existing methods are complex and hinder end-to-end training.	The model leverages a pretrained vision-language model, adds a Relationship Attention layer to extract object pairs likely to form a relationship, and is trained jointly on object and relationship datasets.	Achieves state-of-the-art relationship detection performance on Visual Genome (29.5% mR@100) and GQA benchmarks. Demonstrates strong performance in open-vocabulary and zero-shot settings, benefiting from large-scale pretraining and multi-dataset training. Maintains real-time inference speeds comparable to pure object detectors due to efficient top-k selection in the Relationship Attention layer.	Performance on specialized human-object interaction detection is on par with prior models, potentially limited by task-specific training data. Zero-shot generalization to unseen objects and predicates, a challenge for open-vocabulary VRD models, shows room for improvement.	visual relationship detection, scene graph generation, vision transformer, open vocabulary, encoder-only architecture
2403.14244 Report	Isotropic Gaussian Splatting for Real-Time Radiance Field Rendering	Yuanhao Gong, Lantao Yu, Guanghui Yue	The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this paper, we propose to use isotropic Gaussian kernels to avoid such difficulties in the computation, leading to a higher performance method. The experiments confirm that the proposed method is about {\bf 100X} faster without losing the geometry representation accuracy. The proposed method can be applied in a large range applications where the radiance field is needed, such as 3D reconstruction, view synthesis, and dynamic object modeling.	This paper proposes using scale-adaptive isotropic Gaussian kernels for signal representation, leading to a faster 3D Gaussian splatting method.	While anisotropic Gaussian kernels are better at representing geometry in 3D Gaussian splatting, they lead to computational difficulties. Isotropic kernels offer a more efficient alternative.	The method uses a two-stage approach: 1) initialization with a QuadTree/Octree structure to organize particles carrying color and opacity, and 2) optimization of a loss function that combines reconstruction error and SSIM.	Isotropic Gaussian kernels can achieve high rendering quality with fewer artifacts. The proposed method is significantly faster (around 100 times) in the training process compared to using anisotropic kernels. The use of a tree structure for initialization enables efficient particle management.	The paper focuses on 2D image experiments, further validation is needed for 3D scenarios. Exploring different optimization strategies beyond backpropagation and evolutionary algorithms could be beneficial.	3d gaussian splatting, isotropic gaussian kernels, radiance field, rendering, particle representation
2403.14186 Report	StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN	Jongwoo Choi, Kwanggyoon Seo, Amirsaman Ashtari, Junyong Noh	We propose a method that can generate cinemagraphs automatically from a still landscape image using a pre-trained StyleGAN. Inspired by the success of recent unconditional video generation, we leverage a powerful pre-trained image generator to synthesize high-quality cinemagraphs. Unlike previous approaches that mainly utilize the latent space of a pre-trained StyleGAN, our approach utilizes its deep feature space for both GAN inversion and cinemagraph generation. Specifically, we propose multi-scale deep feature warping (MSDFW), which warps the intermediate features of a pre-trained StyleGAN at different resolutions. By using MSDFW, the generated cinemagraphs are of high resolution and exhibit plausible looping animation. We demonstrate the superiority of our method through user studies and quantitative comparisons with state-of-the-art cinemagraph generation methods and a video generation method that uses a pre-trained StyleGAN.	This paper introduces StyleCineGAN, a novel method for generating high-resolution (1024x1024) cinemagraphs from single landscape images using a pre-trained StyleGAN.	Creating cinemagraphs is typically a manual, time-consuming process. Existing automatic methods are either limited in resolution, require reference videos, or necessitate extensive training of deep generative models. StyleCineGAN addresses these limitations.	The method leverages the deep feature space of a pre-trained StyleGAN for GAN inversion and cinemagraph generation. It employs a multi-scale deep feature warping (MSDFW) technique, applying motion generated from the input image to the StyleGAN's intermediate features at different resolutions. This allows for plausible looping animations while preserving image quality and content.	StyleCineGAN outperforms state-of-the-art cinemagraph generation methods in both qualitative and quantitative comparisons, demonstrating superior static consistency and motion quality. It also surpasses unconditional video generation methods using pre-trained StyleGANs in terms of content preservation, making it particularly suitable for cinemagraph creation. User studies confirm the effectiveness of StyleCineGAN, with participants rating its generated cinemagraphs significantly higher in overall quality.	The automatic motion prediction can be ambiguous for certain images, requiring additional user guidance for accurate motion generation. Isolating the motion of thin structures within animated regions remains challenging due to the multi-scale nature of feature warping.	cinemagraph generation, stylegan, deep feature warping, unconditional video generation, content preservation
2403.14166 Report	Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians	Guangchi Fang, Bing Wang	In this study, we explore the challenge of efficiently representing scenes with a constrained number of Gaussians. Our analysis shifts from traditional graphics and 2D computer vision to the perspective of point clouds, highlighting the inefficient spatial distribution of Gaussian representation as a key limitation in model performance. To address this, we introduce strategies for densification including blur split and depth reinitialization, and simplification through intersection preserving and sampling. These techniques reorganize the spatial positions of the Gaussians, resulting in significant improvements across various datasets and benchmarks in terms of rendering quality, resource consumption, and storage compression. Our Mini-Splatting integrates seamlessly with the original rasterization pipeline, providing a strong baseline for future research in Gaussian-Splatting-based works. \href{https://github.com/fatPeter/mini-splatting}{Code is available}.	This paper presents Mini-Splatting, a novel method to efficiently represent scenes with a constrained number of Gaussians for 3D Gaussian Splatting (3DGS)	3DGS shows great potential in novel view synthesis, however, the large number of Gaussians used can lead to inefficiencies and limit rendering quality and speed.	The authors analyze the spatial distribution of Gaussians and propose densification (blur split and depth reinitialization) and simplification (intersection preserving and sampling) strategies to reorganize Gaussians for a more efficient representation.	Mini-Splatting-D achieves better rendering quality than the baseline 3DGS and even surpasses state-of-the-art neural rendering algorithm Zip-NeRF on some metrics. Mini-Splatting maintains comparable rendering quality to 3DGS while using significantly fewer Gaussians (7x fewer). Mini-Splatting demonstrates significant speed-up in both training and rendering with reduced memory usage.	The depth-based reinitialization strategy may fail in areas without a certain depth value like the sky. Finding the minimal number of Gaussians while maintaining high quality rendering remains a challenge and could benefit from further investigation of uncertainty.	gaussian splatting, point clouds, scene representation, densification, simplification
2403.14155 Report	Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization	Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak	In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.	This paper introduces a novel method to address the challenges of pose variation and identity preservation in zero-shot text-to-image customization, aiming for more diverse and flexible subject-driven generation.	Existing zero-shot customization methods, while effective in separating subject identity from background, struggle with disentangling pose from identity in visual embeddings, leading to pose bias and identity loss when modifying subject poses.	The proposed method employs two key techniques: (1) Contextual Embedding Orchestration: orthogonalizes the visual embedding to the textual embedding subspace, reducing interference and enabling pose variation guided by text prompts. (2) Self-attention Swap: integrates clean identity information from a visual-only guided denoising process into the main generation process, preserving subject identity amidst pose modifications.	The proposed method significantly improves text alignment and pose variation compared to baseline models, as demonstrated qualitatively and quantitatively on a newly introduced 'Deformable Subject Set' and the DreamBooth dataset. It effectively addresses both pose bias and identity loss, generating images that faithfully follow text prompts regarding pose while maintaining subject identity. User study confirms the effectiveness, with the proposed method preferred for both text and image alignment compared to baselines.	The method might struggle with handling multiple, potentially conflicting text prompts simultaneously due to the orthogonalization process. Future work could explore extending the method to address complex compositions involving multiple subjects and intricate interactions.	text-to-image synthesis, zero-shot learning, subject-driven generation, pose variation, identity preservation
2403.14148 Report	Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition	Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar	Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.	This paper introduces CMD (Content-Motion Latent Diffusion Model), an efficient method for video generation that leverages pre-trained image diffusion models.	Existing video diffusion models struggle with high computational costs and memory requirements due to processing high-dimensional videos directly. CMD addresses these limitations.	CMD uses an autoencoder to compress videos into a content frame (similar to an image) and a low-dimensional motion latent representation. A pre-trained image diffusion model generates the content frame, and a lightweight diffusion model generates the motion latent representation.	CMD achieves an FVD score of 238.3 on WebVid-10M, 18.5% better than previous state-of-the-art. It generates a 512x1024 resolution video of 16 frames in 3.1 seconds, 7.7x faster than prior approaches. CMD demonstrates significant efficiency in terms of FLOPs and memory consumption during both training and sampling compared to other methods.	The paper mainly focuses on generating videos of fixed length, limiting its applicability to longer videos. The quality of the autoencoder could be further improved, particularly for videos containing highly dynamic motion.	video generation, diffusion models, latent space, computational efficiency, text-to-video generation
2403.14141 Report	Empowering Segmentation Ability to Multi-modal Large Language Models	Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao, Jinwei Chen, Bo Li	Multi-modal large language models (MLLMs) can understand image-language prompts and demonstrate impressive reasoning ability. In this paper, we extend MLLMs' output by empowering MLLMs with the segmentation ability. The extended MLLMs can both output language responses to the image-language prompts and segment the regions that the complex question or query in the language prompts focuses on. To this end, the existing work, LISA, enlarges the original word embeddings with an additional segment token and fine-tunes dialogue generation and query-focused segmentation together, where the feature of the segment token is used to prompt the segment-anything model. Although they achieve superior segmentation performance, we observe that the dialogue ability decreases by a large margin compared to the original MLLMs. To maintain the original MLLMs' dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user. The MLLMs are first prompted to reason about the simple description of the target region from the complicated user query, then extract the visual attributes of the target region according to the understanding of MLLMs to the image. These visual attributes, such as color and relative locations, are utilized to prompt the downstream segmentation model. Experiments show that the proposed method keeps the original dialogue ability and equips the MLLMs' model with strong reasoning segmentation ability. The code is available at https://github.com/YuqiYang213/LLaVASeg.	This paper proposes LLaVASeg, a novel framework that empowers Multi-modal Large Language Models (MLLMs) with segmentation abilities while preserving their conversational and reasoning skills, unlike previous fine-tuning approaches that often degrade these abilities.	Extending MLLMs to possess segmentation capabilities similar to human perception can significantly enhance their understanding and interaction with visual information, allowing them to both comprehend complex queries and pinpoint relevant regions in images.	LLaVASeg employs a chain-of-thought prompting strategy that guides MLLMs to generate image-specific textual attributes (e.g., color, shape, relative location) for the target region. These attributes are then used to prompt a multi-scale promptable segmentation model that segments the target.	LLaVASeg achieves state-of-the-art segmentation performance on the ReasonSeg dataset, surpassing previous methods like LISA. The proposed chain-of-thought prompting strategy proves highly effective in extracting relevant visual attributes for segmentation. Unlike fine-tuning approaches, LLaVASeg maintains the original dialogue and reasoning capabilities of the MLLMs, as demonstrated by its superior performance on CIDEr and ROUGE-L metrics.	The current framework only supports a single query per interaction, limiting its applicability to more complex scenarios. While LLaVASeg uses off-the-shelf MLLMs, its performance could be further enhanced by incorporating instruction tuning with high-quality chain-of-thought instruction pairs.	multi-modal large language models, reasoning segmentation, chain-of-thought prompting, visual attributes, multi-scale prompting
2403.13951 Report	ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On	Jeffrey Zhang, Kedan Li, Shao-Yu Chang, David Forsyth	Virtual Try-on (VTON) involves generating images of a person wearing selected garments. Diffusion-based methods, in particular, can create high-quality images, but they struggle to maintain the identities of the input garments. We identified this problem stems from the specifics in the training formulation for diffusion. To address this, we propose a unique training scheme that limits the scope in which diffusion is trained. We use a control image that perfectly aligns with the target image during training. In turn, this accurately preserves garment details during inference. We demonstrate our method not only effectively conserves garment details but also allows for layering, styling, and shoe try-on. Our method runs multi-garment try-on in a single inference cycle and can support high-quality zoomed-in generations without training in higher resolutions. Finally, we show our method surpasses prior methods in accuracy and quality.	This paper introduces ACDG-VTON, a novel virtual try-on method leveraging diffusion models while preserving garment details by aligning garment features during training and using a novel zoom-in generation process.	Current virtual try-on systems struggle to balance garment accuracy, generation quality, and user controllability. This method offers a solution by improving accuracy and quality while allowing for multi-garment layering, styling variations, and shoe try-on.	ACDG-VTON uses a warp-then-diffuse pipeline. It generates a control image with aligned garment features and employs a ControlNet architecture with a modified training process. For high-resolution zoom-in, it crops and upsamples specific regions, leveraging the diffusion model's ability to accurately copy details.	ACDG-VTON accurately preserves garment details like logos, text, textures, and patterns, outperforming existing diffusion-based methods. User studies confirm that ACDG-VTON surpasses previous methods in accurately replicating garment details, both in full-body and zoomed-in views. The method improves the visual quality of generated images compared to GAN-based approaches while maintaining garment accuracy and user controllability, as demonstrated through qualitative examples and user studies.	The method's accuracy depends on the performance of the pre-trained warper and layout generator. The system may struggle with garment types or poses not well-represented in the training dataset, such as garments with transparency or complex drape.	virtual try-on, diffusion models, accuracy, controllability, image generation
2403.13826 Report	Measuring Diversity in Co-creative Image Generation	Francisco Ibarrola, Kazjon Grace	Quality and diversity have been proposed as reasonable heuristics for assessing content generated by co-creative systems, but to date there has been little agreement around what constitutes the latter or how to measure it. Proposed approaches for assessing generative models in terms of diversity have limitations in that they compare the model's outputs to a ground truth that in the era of large pre-trained generative models might not be available, or entail an impractical number of computations. We propose an alternative based on entropy of neural network encodings for comparing diversity between sets of images that does not require ground-truth knowledge and is easy to compute. We also compare two pre-trained networks and show how the choice relates to the notion of diversity that we want to evaluate. We conclude with a discussion of the potential applications of these measures for ideation in interactive systems, model evaluation, and more broadly within computational creativity.	This paper proposes novel, computationally inexpensive methods for estimating within-set diversity of images generated by text-to-image AI systems, using either Truncated Inception Entropy (TIE) or Truncated CLIP Entropy (TCE).	Diversity in generated images is crucial for interactive AI systems to support creative exploration and problem reframing, but current methods lack practicality or require ground truth data, which is often unavailable for large pre-trained models.	The methods involve calculating the entropy of the empirical distribution of a set of generated images in a latent space derived from pre-trained networks (InceptionV3 for TIE and CLIP for TCE).	TIE and TCE successfully differentiate between sets of images generated with varying degrees of diversity in prompt and style. TCE, based on a model trained on both text and images, is more sensitive to semantic variations in images compared to TIE. Preliminary experiments suggest TCE could be applicable to assessing text diversity as well.	The proposed measures require further validation through human perception studies to confirm their alignment with human judgment of diversity. Future work will explore the use of other pre-trained networks and layers, potentially leading to measures with different biases (e.g., more sensitive to visual textures).	computational creativity, image generation, diversity measures, co-creative systems, generative ai
2403.13807 Report	Editing Massive Concepts in Text-to-Image Diffusion Models	Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu	Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.	This paper proposes EMCID, a two-stage method for editing a large number of concepts in text-to-image diffusion models.	Text-to-image diffusion models can generate outdated, biased, or incorrect content. Editing concepts within these models offers a practical solution without requiring expensive retraining.	EMCID uses dual self-distillation to optimize concept representations in the first stage. In the second stage, it uses a closed-form solution for multi-layer model editing, enabling large-scale concept updates.	EMCID successfully edits up to 1,000 concepts while preserving the generation quality for non-edited concepts. A new comprehensive benchmark, ICEB, is introduced to evaluate large-scale concept editing in T2I models. EMCID outperforms previous methods in terms of scalability, generalization ability, and specificity, particularly for editing a large number of concepts.	EMCID faces limitations in erasing NSFW content due to the complexity of visual concepts and potential associations. Future work can explore combining EMCID with methods targeting specific aspects like NSFW content removal for a more comprehensive solution.	text-to-image generation, diffusion models, concept editing, model editing, benchmarking
2403.13806 Report	RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS	Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, Federico Tombari	Recent advances in view synthesis and real-time rendering have achieved photorealistic quality at impressive rendering speeds. While Radiance Field-based methods achieve state-of-the-art quality in challenging scenarios such as in-the-wild captures and large-scale scenes, they often suffer from excessively high compute requirements linked to volumetric rendering. Gaussian Splatting-based methods, on the other hand, rely on rasterization and naturally achieve real-time rendering but suffer from brittle optimization heuristics that underperform on more challenging scenes. In this work, we present RadSplat, a lightweight method for robust real-time rendering of complex scenes. Our main contributions are threefold. First, we use radiance fields as a prior and supervision signal for optimizing point-based scene representations, leading to improved quality and more robust optimization. Next, we develop a novel pruning technique reducing the overall point count while maintaining high quality, leading to smaller and more compact scene representations with faster inference speeds. Finally, we propose a novel test-time filtering approach that further accelerates rendering and allows to scale to larger, house-sized scenes. We find that our method enables state-of-the-art synthesis of complex captures at 900+ FPS.	RadSplat, a method combining radiance fields and Gaussian Splatting for robust real-time rendering of complex scenes.	Achieve real-time rendering of complex scenes with high quality, addressing limitations of both radiance field (computationally expensive) and Gaussian Splatting (brittle optimization) methods.	1. Train a radiance field (ZipNeRF) as a robust prior. 2. Initialize and supervise a point-based 3DGS representation using the radiance field. 3. Introduce a ray contribution-based pruning technique for compact scene representation. 4. Perform viewpoint-based visibility filtering to accelerate rendering.	Achieves state-of-the-art view synthesis quality, outperforming previous real-time methods and even surpassing offline method ZipNeRF in some metrics. RadSplat renders at speeds exceeding 900 FPS, significantly faster than prior works. Demonstrates robustness in handling complex real-world captures with lighting and exposure variations.	Training time is longer compared to single-representation models. A small quality gap to ZipNeRF remains on large-scale scenes.	real-time rendering, gaussian splatting, neural fields, view synthesis, 3d reconstruction
2403.13788 Report	DepthFM: Fast Monocular Depth Estimation with Flow Matching	Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer	Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.	Presents DepthFM, a flow matching model for fast monocular depth estimation achieving state-of-the-art results with low computational cost.	Crucial for various vision tasks, existing discriminative methods produce blurry depth maps, and generative methods are slow.	Leverages pre-trained image diffusion models as prior and employs data-dependent flow matching to learn a direct mapping from input image to depth, incorporating an auxiliary surface normals loss for enhanced geometric accuracy.	Achieves state-of-the-art performance on standard benchmarks using only synthetic training data. Significantly faster than diffusion-based methods due to its one-step inference capability. Provides reliable confidence estimates, unlike discriminative approaches.	Relies on accurate camera intrinsics for surface normal estimation. Limited exploration of different pre-trained diffusion models as priors.	depth estimation, flow matching, generative model, zero-shot learning, confidence estimation
2403.13745 Report	Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation	Fu-Yun Wang, Xiaoshi Wu, Zhaoyang Huang, Xiaoyu Shi, Dazhong Shen, Guanglu Song, Yu Liu, Hongsheng Li	Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation, a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning.	Introduces MOTIA, a diffusion-based video outpainting pipeline that leverages both intrinsic data-specific patterns of source videos and the image/video generative prior for effective outpainting.	Video outpainting is crucial for adapting videos to various aspect ratios and screen sizes seamlessly while preserving temporal and spatial consistency, which is challenging for existing methods.	Employs a two-stage process: 1) input-specific adaptation by conducting pseudo outpainting learning on the source video itself and 2) pattern-aware outpainting by combining learned patterns with diffusion models, incorporating spatial-aware insertion and noise regret strategies.	Significantly outperforms state-of-the-art methods in quantitative metrics (PSNR, SSIM, LPIPS, FVD) on DAVIS and YouTube-VOS benchmarks. Demonstrates superior visual quality and realism in qualitative comparisons, effectively handling both foreground and background outpainting. Showcases flexibility in handling various mask types, video resolutions and lengths, and arbitrary styles, surpassing previous limitations.	Struggles with outpainting videos containing limited source information. Future work could explore better utilization of temporal information for enhanced consistency.	video outpainting, diffusion models, input-specific adaptation, pattern-aware outpainting, spatial-aware insertion
2403.13600 Report	VL-Mamba: Exploring State Space Models for Multimodal Learning	Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu	Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.	This paper introduces VL-Mamba, the first exploration of using the state space model 'Mamba' for multimodal learning tasks, aiming to leverage its efficiency for handling long sequences in vision and language understanding.	Existing multimodal large language models (MLLMs) heavily rely on Transformers, which suffer from quadratic complexity in attention mechanisms, making them computationally expensive for long sequences. VL-Mamba addresses this limitation by employing the Mamba model known for its linear scaling in sequence length.	VL-Mamba comprises a pre-trained Mamba language model, a vision encoder (Vision Transformer), and a novel MultiModal Connector (MMC). The MMC, incorporating a 2D vision selective scan mechanism, bridges the gap between non-causal image data and the causal modeling of SSMs. Two scan mechanisms, Bidirectional and Cross Scanning, are explored within the MMC.	VL-Mamba achieves competitive performance on eight multimodal benchmarks, comparable to state-of-the-art MLLMs despite having fewer parameters and training data. The study shows that VL-Mamba outperforms some larger models, highlighting the efficiency of SSMs for multimodal learning. Ablation studies confirm the effectiveness of different components, including language model variants, vision encoders, MMC architectures, and scan mechanisms.	The paper primarily focuses on the 2D selective scan mechanism in the MMC, leaving the exploration of higher-quality training data for future work. Future research could investigate incorporating the training data used by top-performing MLLMs to potentially enhance VL-Mamba's performance further.	multimodal learning, large language models, state space models, vision and language, mamba
2403.13589 Report	ReGround: Improving Textual and Spatial Grounding at No Cost	Yuseung Lee, Minhyuk Sung	When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.	This paper introduces ReGround, a method to improve textual grounding in layout-guided image generation by rewiring the attention mechanism in GLIGEN from sequential to parallel.	Existing methods like GLIGEN, while enabling spatial grounding with bounding boxes, often overlook textual details in prompts, leading to a trade-off between textual and spatial accuracy.	The authors propose a simple rewiring of the network architecture in GLIGEN, changing the relationship between gated self-attention (spatial grounding) and cross-attention (textual grounding) from sequential to parallel.	ReGround significantly reduces the trade-off between textual and spatial grounding, achieving higher CLIP scores (textual grounding) while maintaining comparable YOLO scores (spatial grounding) to GLIGEN. The improvement is consistent across different datasets, including MS-COCO and a newly introduced NSR-1K-GPT dataset. ReGround's effectiveness extends to other frameworks that use GLIGEN as a backbone, such as BoxDiff, demonstrating its broad applicability.	The study primarily focuses on GLIGEN and its application with bounding box layouts, potentially limiting its generalizability to other spatial grounding techniques. Further investigation into the impact of rewiring on more complex and diverse layout representations could be beneficial.	textual grounding, spatial grounding, image generation, diffusion models, network rewiring
2403.13551 Report	Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing	Hangeol Chang, Jinho Chang, Jong Chul Ye	Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.	Presents Ground-A-Score, a model-agnostic image editing method using grounding during score distillation for multi-attribute editing, improving accuracy and detail in complex prompts.	Existing score distillation methods struggle to accurately reflect complex prompts with multiple editing requirements, often overlooking specific objects or compositions.	Breaks down complex prompts into subtasks, calculates score gradients separately, aggregates them with grounding information, and introduces a null-text penalty to prevent object distortion during optimization.	Successfully edits multiple image attributes according to complex prompts, outperforming baseline models in qualitative assessments. Quantitative analyses confirm Ground-A-Score achieves higher image quality (lower LPIPS) and better prompt adherence (higher masked CLIP score). User study confirms Ground-A-Score produces edits more aligned with user intent, preserving original features while ensuring high overall image quality.	Reliance on pre-trained models (T2I diffusion, grounding, LLM) may inherit their limitations. Performance may vary across diverse image domains and with highly complex or ambiguous prompts.	image editing, diffusion models, score distillation, multi-attribute editing, grounding
2403.13535 Report	IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models	Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, Ziyong Feng	Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.	This paper presents IDAdapter, a tuning-free method for personalizing text-to-image synthesis models using a single face image, achieving high diversity in generated images without test-time fine-tuning.	Existing personalization methods struggle with challenges like test-time fine-tuning, needing multiple input images, low identity preservation, and limited output diversity. IDAdapter addresses these limitations by enabling diverse and high-fidelity image generation from a single face image.	IDAdapter integrates mixed features from multiple reference images during training to enrich identity information, guiding the model to generate images with diverse styles, expressions, and angles. It employs textual and visual injections to incorporate a personalized concept and uses a face identity loss to preserve identity.	IDAdapter outperforms existing methods in generating diverse and high-fidelity personalized images. It successfully decouples identity and non-identity features, allowing for variations in expression, pose, and style while maintaining facial fidelity. The use of mixed facial features from multiple reference images significantly improves diversity and identity preservation compared to using a single image.	The model's performance might be influenced by the quality and diversity of the training dataset. Future work could explore extending the method to handle more complex personalization scenarios, such as full-body generation with diverse clothing and accessories.	text-to-image synthesis, personalization, diffusion models, face generation, tuning-free
2403.13524 Report	Compress3D: a Compressed Latent Space for 3D Generation from a Single Image	Bowen Zhang, Tianyu Yang, Yu Li, Lei Zhang, Xi Zhao	3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU.	This paper introduces Compress3D, a novel two-stage diffusion model for generating high-quality 3D models from single images using a compressed latent space.	Efficiently generating high-quality 3D models from single images is crucial for various applications, but remains challenging due to limitations in data size and computational efficiency.	The method employs a triplane autoencoder with a 3D-aware cross-attention mechanism to compress 3D models into a compact latent space. It then utilizes a diffusion prior model to estimate shape embeddings from image embeddings, and a triplane diffusion model generates 3D models conditioned on both shape and image embeddings.	Compress3D outperforms state-of-the-art methods in terms of FID and CLIP similarity, indicating superior generation quality. It requires significantly less training data and time compared to previous approaches. The method enables fast generation of high-quality 3D assets in approximately 7 seconds on a single A100 GPU.	The model's performance might be further improved by exploring alternative 3D representations beyond FlexiCubes. Investigating the generalization ability of Compress3D on more diverse datasets with complex scenes and objects could be beneficial.	3d generation, diffusion model, triplane representation, latent space compression, shape embedding
2403.13447 Report	HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models	Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang	Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.	This paper introduces HyperLLaVA, an enhanced Multimodal Large Language Model (MLLM) that adaptively tunes both projector and LLM parameters using dynamic visual and language experts derived from HyperNetworks.	Existing MLLMs often rely on static tuning, limiting their flexibility and performance across diverse multimodal tasks. HyperLLaVA addresses this limitation by dynamically adapting to visual and language inputs, resulting in superior performance.	HyperLLaVA employs a two-stage training process: 1) Visual-language alignment: A visual expert dynamically adjusts the projector's output based on visual features. 2) Multimodal instruction tuning: A language expert dynamically models LLM layers guided by intermediate LLM outputs.	HyperLLaVA significantly outperforms LLaVA and other state-of-the-art MLLMs on 11 out of 12 benchmarks, including VQA, image captioning, and visual reasoning tasks. The dynamic tuning approach in HyperLLaVA proves more effective than static tuning, demonstrating its ability to generate adaptive visual tokens and instruction-specific features. HyperLLaVA's language expert functions as a parameter-efficient fine-tuning method, achieving comparable performance to traditional methods while updating fewer parameters.	The impact of varying the size of the visual and language experts on performance needs further investigation. Exploring the application of dynamic tuning to other MLLM architectures and pretraining objectives could be a promising future direction.	multimodal large language model, hypernetwork, dynamic tuning, parameter-efficient fine-tuning, vision-language alignment
2403.13438 Report	See, Imagine, Plan: Discovering and Hallucinating Tasks from a Single Image	Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, Andrew Markham	Humans can not only recognize and understand the world in its current state but also envision future scenarios that extend beyond immediate perception. To resemble this profound human capacity, we introduce zero-shot task hallucination -- given a single RGB image of any scene comprising unknown environments and objects, our model can identify potential tasks and imagine their execution in a vivid narrative, realized as a video. We develop a modular pipeline that progressively enhances scene decomposition, comprehension, and reconstruction, incorporating VLM for dynamic interaction and 3D motion planning for object trajectories. Our model can discover diverse tasks, with the generated task videos demonstrating realistic and compelling visual outcomes that are understandable by both machines and humans. Project Page: https://dannymcy.github.io/zeroshot_task_hallucination/	This paper introduces 'zero-shot task hallucination,' enabling a model to identify potential tasks from a single RGB image of an unknown scene and generate a video visualizing the task execution.	This work mimics the human ability to envision and plan future scenarios from visual perception, potentially leading to applications like robotic task discovery and interactive visual guidance.	The paper proposes a modular pipeline combining VLM for task discovery, 2D/3D scene reconstruction, a novel axes-constrained 3D planning approach for object trajectory generation, and rendering for video creation.	The model discovers diverse and contextually relevant tasks within various scenes. Generated videos demonstrate realistic object manipulation aligned with task descriptions. Human evaluation confirms the visual quality and interpretability of the generated task videos.	The quality of generated videos can be influenced by the performance of individual components, such as segmentation or 3D reconstruction. The current approach primarily focuses on rigid object manipulation, with future work exploring deformable objects and more complex interactions.	task hallucination, vision-language models, 3d scene reconstruction, motion planning, video generation
2403.13408 Report	S2DM: Sector-Shaped Diffusion Models for Video Generation	Haoran Lang, Yuxuan Ge, Zheng Tian	Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at https://s2dm.github.io/S2DM/.	This paper introduces S2DM, a novel Sector-Shaped Diffusion Model for generating videos with high consistency and coherence by modeling the generation process as a sector-shaped diffusion region.	Generating consistent and continuous videos using diffusion models is challenging due to the difficulty in aligning video frames with desired temporal features while preserving semantic and stochastic features.	S2DM employs a sector-shaped diffusion region formed by multiple ray-shaped reverse diffusion processes starting from the same noise point. Each process is guided by identical semantic conditions and varying temporal conditions to ensure consistency and temporal alignment.	S2DM outperforms existing methods in optical flow-guided video generation tasks on MHAD and MUG datasets. A two-stage text-to-video generation strategy using S2DM achieves comparable results to state-of-the-art methods. Ablation studies confirm the effectiveness of the shared noise assumption in S2DM for maintaining video consistency.	The current method of incorporating semantic and temporal conditions could be improved for better control. Exploring additional temporal conditions beyond optical flow would further demonstrate the generality of S2DM.	video generation, diffusion models, optical flow, text-to-video, consistency
2403.13352 Report	AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation	Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan	Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.	\modelname{} is a novel framework that leverages AI-generated feedback and Direct Preference Optimization (DPO) to improve the quality of text-to-image generation.	Existing methods for enhancing text-to-image generation often rely on expensive human-labeled data and may not fully capture the nuances of image quality across different aspects like style, coherence, and aesthetics.	The framework uses LLMs to generate diverse textual prompts and corresponding question-answer pairs. Then, it uses a VQA model, CLIP score, and aesthetic scoring model to evaluate the generated images. Finally, it applies DPO to fine-tune the diffusion model based on the constructed preference pairs.	Significantly improves image quality across different models and benchmarks, as demonstrated by higher VQA, CLIP, and aesthetic scores. Generates images that are more faithful to the input prompts and exhibit better coherence with real-world rules. Achieves a 100\% data conversion efficiency compared to lower rates in methods like DreamSync.	The performance of \modelname{} is dependent on the capabilities and potential biases of the LLMs and aesthetic scoring models used. The introduction of random noise for image diversity might sometimes lead to reduced consistency between some images and their prompts.	text-to-image generation, diffusion models, direct preference optimization, ai feedback, image quality evaluation
2403.13304 Report	DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception	Yibo Wang, Ruiyuan Gao, Kai Chen, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang	Current perceptive models heavily depend on resource-intensive datasets, prompting the need for innovative solutions. Leveraging recent advances in diffusion models, synthetic data, by constructing image inputs from various annotations, proves beneficial for downstream tasks. While prior methods have separately addressed generative and perceptive models, DetDiffusion, for the first time, harmonizes both, tackling the challenges in generating effective data for perceptive models. To enhance image generation with perceptive models, we introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. To boost the performance of specific perceptive models, our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation. Experimental results from the object detection task highlight DetDiffusion's superior performance, establishing a new state-of-the-art in layout-guided generation. Furthermore, image syntheses from DetDiffusion can effectively augment training data, significantly enhancing downstream detection performance.	This paper introduces DetDiffusion, a novel framework that leverages the synergy between generative and perceptive models to enhance controlled image generation and improve the performance of downstream perception tasks.	Existing perceptive models rely heavily on large, labeled datasets, which are expensive to obtain. DetDiffusion addresses this by generating synthetic data tailored for perception tasks, potentially improving data efficiency and model performance.	DetDiffusion integrates perception-aware attributes (P.A. Attr) extracted from a pre-trained detector and a perception-aware loss (P.A. loss) based on segmentation into a geometric-aware diffusion model.	DetDiffusion achieves state-of-the-art performance in layout-guided image generation, surpassing previous methods in FID and YOLO score. Synthetic data generated by DetDiffusion effectively augments training data, leading to significant improvements in downstream object detection performance. The framework demonstrates control over the difficulty of generated images by manipulating the perception-aware attributes, enabling the generation of challenging examples for improved training.	Currently, DetDiffusion primarily focuses on object detection tasks. Expanding its applicability to other perception tasks is a potential direction for future research. Further exploration of how to generate high-quality, human-aligned images while mitigating harmful or toxic content is crucial for practical applications.	generative models, perceptive models, diffusion models, synthetic data generation, object detection
2403.13187 Report	Evolutionary Optimization of Model Merging Recipes	Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha	We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.	The paper introduces Evolutionary Model Merge, a novel approach using evolutionary algorithms to automatically discover optimal combinations of open-source foundation models, creating new models with enhanced capabilities without extensive training.	Model merging, while promising for its cost-effectiveness, currently relies on human intuition and domain knowledge, limiting its potential. This paper presents an automated approach to overcome this limitation and democratize foundation model development.	The method leverages evolutionary algorithms to optimize model merging in both parameter space (e.g., using DARE-TIES for weight merging) and data flow space (e.g., evolving the inference path through model layers), enabling exploration of a wider range of model combinations.	Generated a Japanese LLM with Math reasoning abilities, achieving state-of-the-art performance on Japanese LLM benchmarks, surpassing even larger models. Created a culturally-aware Japanese VLM that excels in describing Japanese culture-specific content, outperforming existing Japanese VLMs. Demonstrated the effectiveness of combining parameter space and data flow space merging for enhanced model capabilities.	The merged models inherit limitations of the source models, such as potential for logical inconsistencies. The study does not include instruction fine-tuning or alignment, which could lead to factually flawed outputs.	evolutionary algorithms, model merging, foundation models, language models, vision-language models
2403.13163 Report	DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring	Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu	Blurry images may contain local and global non-uniform artifacts, which complicate the deblurring process and make it more challenging to achieve satisfactory results. Recently, Transformers generate improved deblurring outcomes than existing CNN architectures. However, the large model size and long inference time are still two bothersome issues which have not been fully explored. To this end, we propose DeblurDiNAT, a compact encoder-decoder Transformer which efficiently restores clean images from real-world blurry ones. We adopt an alternating dilation factor structure with the aim of global-local feature learning. Also, we observe that simply using self-attention layers in networks does not always produce good deblurred results. To solve this problem, we propose a channel modulation self-attention (CMSA) block, where a cross-channel learner (CCL) is utilized to capture channel relationships. In addition, we present a divide and multiply feed-forward network (DMFN) allowing fast feature propagation. Moreover, we design a lightweight gated feature fusion (LGFF) module, which performs controlled feature merging. Comprehensive experimental results show that the proposed model, named DeblurDiNAT, provides a favorable performance boost without introducing noticeable computational costs over the baseline, and achieves state-of-the-art (SOTA) performance on several image deblurring datasets. Compared to nearest competitors, our space-efficient and time-saving method demonstrates a stronger generalization ability with 3%-68% fewer parameters and produces deblurred images that are visually closer to the ground truth.	This paper presents DeblurDiNAT, a lightweight and effective Transformer for image deblurring, which leverages dilated neighborhood attention and channel modulation to capture global-local blur information efficiently.	Existing Transformer-based image deblurring methods often struggle to balance computational efficiency with deblurring accuracy, making it challenging to achieve optimal results without high computational costs.	The proposed DeblurDiNAT utilizes an alternating dilation factor structure with dilated neighborhood attention for capturing both global and local blur patterns. It introduces a channel modulation self-attention block (CMSA) to capture cross-channel relationships effectively. Additionally, it employs a divide and multiply feed-forward network (DMFN) for fast feature propagation and a lightweight gated feature fusion (LGFF) module for efficient feature aggregation.	DeblurDiNAT-L achieves state-of-the-art performance on GoPro and HIDE datasets while being significantly faster and requiring less memory than competitors. The proposed method demonstrates superior generalization ability, outperforming existing models on real-world datasets RealBlur-R and RealBlur-J. Ablation studies confirm the effectiveness of each proposed component (ADFS, CMSA, DMFN, LGFF) in improving deblurring performance and efficiency.	The current implementation of DeblurDiNAT focuses on single-image deblurring. Exploring the potential of DeblurDiNAT for video deblurring could be a promising direction.	image deblurring, transformer, dilated neighborhood attention, channel modulation, lightweight model
2403.13064 Report	SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model	Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas	We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.	Introduces SceneScript, a method for reconstructing 3D scenes by predicting a sequence of structured language commands from egocentric videos.	Provides a compact, editable, and interpretable scene representation that can be readily extended to new tasks, bridging the gap between 3D reconstruction and language models.	Uses an encoder-decoder architecture with different encoder options (point cloud, posed images, combined) and a transformer decoder to predict a sequence of language commands describing walls, doors, windows, bounding boxes, and more.	Achieves state-of-the-art architectural layout estimation on the proposed Aria Synthetic Environments (ASE) dataset. Shows competitive 3D object detection performance on ASE and ScanNet by simply adding a bounding box command. Demonstrates extensibility by incorporating commands for coarse 3D object parts, curved entities, entity compositions, and object states.	Structured language commands are currently manually defined. Capturing fine-grained geometric details with high precision remains challenging due to the high-level nature of the commands.	3d reconstruction, scene representation, structured language, egocentric vision, synthetic datasets
2403.13044 Report	Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos	Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi	We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserves the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.	This paper introduces Magic Fixup, a novel diffusion-based image editing method that allows users to create photorealistic edits through a simple 'cut-and-transform' interface.	Existing image editing tools often require extensive manual work or struggle to preserve realism and faithfulness to user input. This method seeks to bridge this gap by combining intuitive user controls with the power of generative models.	The approach leverages a dual diffusion model setup. A 'detail extractor' model processes the original image to capture fine-grained details, while a 'synthesizer' model generates the final output, guided by the user's coarse edit and the extracted details. The models are trained on a dataset of paired video frames, where the input frame is automatically warped to match the target frame using flow-based and piecewise affine motion models.	Magic Fixup demonstrates superior performance in preserving object identity and generating realistic details compared to existing editing tools, as evidenced by qualitative results and a user study. The use of video data and the proposed motion models is crucial for training a model capable of realistic and faithful image recomposition and reposing. The cross-attention mechanism for detail transfer significantly improves the model's ability to harmonize edits and maintain realism.	The model's ability to handle out-of-domain images (e.g., paintings) is limited by the video-based training data. The method inherits the limitations of the underlying diffusion models, particularly in areas like generating hands and faces.	image editing, diffusion models, generative models, video data, user interface
2403.13043 Report	When Do We Not Need Larger Vision Models?	Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell	Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S$^2$ achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S$^2$ is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S$^2$ can match or even exceed the advantage of larger models. We release a Python package that can apply S$^2$ on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.	This paper challenges the assumption that larger vision models are always better, proposing "Scaling on Scales" (S^2) where a smaller model is run on multiple image scales instead of increasing model size.	Scaling model size, while effective, is resource-intensive. S^2 offers a potentially more efficient way to achieve comparable or better visual understanding.	The authors introduce "S^2-Wrapper," a mechanism to apply multi-scale processing to any pre-trained vision model without additional parameters. They compare S^2 with traditional model size scaling across tasks like image classification, segmentation, depth estimation, MLLM benchmarks, and robotic manipulation.	Smaller models with S^2 often match or outperform larger models on various tasks, achieving state-of-the-art on MLLM visual detail understanding (V* benchmark). Larger models show advantage on hard examples, but their features can be largely approximated by those of multi-scale smaller models. Pre-training with S^2 further improves smaller models, suggesting comparable learning capacity to larger counterparts.	The optimal balance between model size and image scales needs further exploration for different pre-trained models. Future work includes exploring scale-selective processing and parallel processing of single images with S^2.	multi-scale representation learning, vision transformer, model scaling, multimodal learning, robotic manipulation
2403.12966 Report	Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models	Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu	In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features. By integrating Chain-of-Spot with instruct-following LLaVA-1.5 models, the process of image reasoning consistently improves performance across a wide range of multimodal datasets and benchmarks without bells and whistles and achieves new state-of-the-art results. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications. Code and models are available at https://github.com/dongyh20/Chain-of-Spot	The paper introduces Chain-of-Spot (CoS), a novel interactive reasoning approach for large vision-language models (LVLMs) that improves visual understanding by guiding models to focus on key regions of interest (ROI) within an image.	Existing LVLMs often struggle to extract useful features tailored to specific questions and are limited by the use of lower-resolution images. Chain-of-Spot addresses these issues by providing multi-granularity image features and enabling more focused analysis.	CoS uses a relevance map between language tokens and image features to identify the ROI. During inference, the model first identifies the ROI and then uses both the global and cropped ROI features to generate the response.	CoS significantly improves the performance of LLaVA-1.5 on various visual question answering and multimodal benchmarks. The method achieves state-of-the-art results on multiple datasets, including VQAv2, GQA, VizWiz, SEEDBench, MMBench, and MM-Vet. Analysis shows that CoS effectively guides the model's focus to relevant image regions, improving reasoning and accuracy.	One limitation is the potential for insufficient training data to adequately guide ROI identification. Future work could explore expanding the training dataset and investigating the ethical implications of enhanced LVLMs.	large vision-language models, interactive reasoning, chain-of-spot, region of interest, multimodal learning
2403.12965 Report	Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment	Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, Shuai Xiao	This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable solution. Besides generating high-fidelity results, our method supports users to precisely manipulate the wearing style. To achieve this goal, we first construct a strong pipeline for standard virtual try-on, supporting single/multiple garment try-on and model-to-model settings in complicated scenarios. To make it manipulable, we propose sparse correspondence alignment which involves point-based control to guide the generation for specific locations. With this design, Wear-Any-Way gets state-of-the-art performance for the standard setting and provides a novel interaction form for customizing the wearing style. For instance, it supports users to drag the sleeve to make it rolled up, drag the coat to make it open, and utilize clicks to control the style of tuck, etc. Wear-Any-Way enables more liberated and flexible expressions of the attires, holding profound implications in the fashion industry.	This paper presents Wear-Any-Way, a novel framework for virtual try-on that not only generates high-fidelity results but also allows users to customize wearing styles.	Existing virtual try-on methods often lack detail fidelity and controllability over garment wearing style, limiting their application in fashion.	The proposed approach leverages a dual-branch diffusion model with a reference U-Net for detail preservation and a sparse correspondence alignment module for point-based manipulation.	Wear-Any-Way achieves state-of-the-art performance on standard virtual try-on benchmarks, outperforming existing methods in fidelity and detail. The method supports flexible customization, enabling users to control garment features like sleeve rolls, coat openness, and tuck styles through click-and-drag interactions. A novel point-pair collection pipeline based on a Siamese diffusion model is proposed to effectively learn garment-person correspondence.	The method might generate artifacts for fine details like hands, especially in lower resolutions. Future work includes exploring higher resolution models and addressing the challenge of generating complex garment interactions (e.g., multiple layers of clothing).	virtual try-on, customizable generation, diffusion model, point-based control, sparse correspondence alignment
2403.12963 Report	FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis	Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li	In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The code will be released at https://github.com/LeonHLJ/FouriScale.	This paper introduces FouriScale, a training-free method to generate high-resolution images from pre-trained diffusion models by addressing the issue of repetitive patterns and structural distortions often seen in upscaling.	Existing diffusion models are typically trained at limited resolutions, and applying them to higher resolutions often leads to undesirable artifacts and inconsistencies. FouriScale offers a way to overcome these limitations without needing to retrain the model.	FouriScale analyzes the problem in the frequency domain and introduces two key operations: 1) dilated convolution to maintain structural consistency, and 2) low-pass filtering to ensure scale consistency across resolutions. A padding-then-crop strategy is used for arbitrary aspect ratios, and a guidance mechanism further improves image quality.	FouriScale outperforms existing training-free methods in quantitative metrics like FID and KID, showing better image quality and diversity at higher resolutions. The method effectively reduces repetitive patterns and preserves structural details even with significant upscaling factors (up to 16x). FouriScale is shown to be compatible with various pre-trained models like SD 1.5, SD 2.1, and SDXL, and can be integrated with techniques like LoRA.	While effective at high resolutions, FouriScale still faces challenges with ultra-high resolutions (e.g., 4096x4096) where artifacts might occur. The current implementation primarily focuses on convolutional layers, limiting its application to purely transformer-based diffusion models.	diffusion model, training-free, high-resolution synthesis, frequency domain analysis, text-to-image generation
2403.12962 Report	FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation	Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy	The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.	This paper introduces FRESCO, a novel zero-shot diffusion framework that leverages both inter-frame and intra-frame correspondences for coherent and flexible video translation.	Existing zero-shot video translation methods, while promising, struggle with temporal inconsistencies, particularly in scenarios with occlusion or rapid motion. This work addresses these limitations by introducing intra-frame spatial correspondence as a key constraint.	FRESCO adapts a pre-trained image diffusion model for videos using two key mechanisms: 1) FRESCO-aware feature optimization, which directly optimizes decoder features to align with the spatial-temporal coherence of the input video. 2) FRESCO-guided attention, which incorporates spatial and temporal cues to guide the attention mechanism in the U-Net.	FRESCO effectively addresses temporal inconsistencies observed in previous methods, producing significantly more coherent results. The framework's modular design allows for independent analysis of spatial and temporal adaptations, demonstrating their individual contributions to overall performance. FRESCO exhibits high compatibility with existing image diffusion techniques, enabling its application in other video editing tasks such as colorization.	While effective, FRESCO's reliance on optical flow from the original video may limit its ability to handle large shape deformations. Future work could explore adaptive combinations with pixel-level alignment methods and incorporate learned motion priors for handling larger deformations.	video translation, diffusion models, zero-shot learning, temporal consistency, spatial correspondence
2403.12960 Report	FaceXFormer: A Unified Transformer for Facial Analysis	Kartik Narayan, Vibashan VS, Rama Chellappa, Vishal M. Patel	In this work, we introduce FaceXformer, an end-to-end unified transformer model for a comprehensive range of facial analysis tasks such as face parsing, landmark detection, head pose estimation, attributes recognition, and estimation of age, gender, race, and landmarks visibility. Conventional methods in face analysis have often relied on task-specific designs and preprocessing techniques, which limit their approach to a unified architecture. Unlike these conventional methods, our FaceXformer leverages a transformer-based encoder-decoder architecture where each task is treated as a learnable token, enabling the integration of multiple tasks within a single framework. Moreover, we propose a parameter-efficient decoder, FaceX, which jointly processes face and task tokens, thereby learning generalized and robust face representations across different tasks. To the best of our knowledge, this is the first work to propose a single model capable of handling all these facial analysis tasks using transformers. We conducted a comprehensive analysis of effective backbones for unified face task processing and evaluated different task queries and the synergy between them. We conduct experiments against state-of-the-art specialized models and previous multi-task models in both intra-dataset and cross-dataset evaluations across multiple benchmarks. Additionally, our model effectively handles images "in-the-wild," demonstrating its robustness and generalizability across eight different tasks, all while maintaining the real-time performance of 37 FPS.	This paper introduces FaceXformer, a unified transformer-based model for eight facial analysis tasks: face parsing, landmark detection, head pose estimation, attributes recognition, age estimation, gender estimation, race estimation, and landmarks visibility prediction.	Existing facial analysis models are often task-specific, limiting their applicability to multiple tasks and hindering the development of a single unified model. A unified model offers several advantages: learning robust and generalized face representations, modeling intra-task relationships, and enhancing overall performance through task synergy.	FaceXformer uses a transformer-based encoder-decoder architecture. It leverages multi-scale features from the input face image and fuses them into a unified representation. Each facial analysis task is treated as a unique, learnable token processed by a parameter-efficient decoder (FaceX) to interact with the unified face representation. Task-specific predictions are then generated from the refined task tokens.	FaceXformer achieves state-of-the-art performance in face parsing and attributes recognition. It demonstrates competitive performance in landmark detection and head pose estimation compared to leading methods. The model effectively handles in-the-wild images, showing robustness and generalization across all eight tasks while maintaining real-time performance (37 FPS).	While FaceXformer supports tokens for various tasks, it lacks full interactivity and promptability. It does not achieve state-of-the-art performance in tasks like landmark detection and head pose estimation due to not utilizing auxiliary information and advanced representations.	facial analysis, transformer, multi-task learning, computer vision, deep learning
2403.12957 Report	GVGEN: Text-to-3D Generation with Volumetric Representation	Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, Tong He	In recent years, 3D Gaussian splatting has emerged as a powerful technique for 3D reconstruction and generation, known for its fast and high-quality rendering capabilities. To address these shortcomings, this paper introduces a novel diffusion-based framework, GVGEN, designed to efficiently generate 3D Gaussian representations from text input. We propose two innovative techniques:(1) Structured Volumetric Representation. We first arrange disorganized 3D Gaussian points as a structured form GaussianVolume. This transformation allows the capture of intricate texture details within a volume composed of a fixed number of Gaussians. To better optimize the representation of these details, we propose a unique pruning and densifying method named the Candidate Pool Strategy, enhancing detail fidelity through selective optimization. (2) Coarse-to-fine Generation Pipeline. To simplify the generation of GaussianVolume and empower the model to generate instances with detailed 3D geometry, we propose a coarse-to-fine pipeline. It initially constructs a basic geometric structure, followed by the prediction of complete Gaussian attributes. Our framework, GVGEN, demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods. Simultaneously, it maintains a fast generation speed ($\sim$7 seconds), effectively striking a balance between quality and efficiency.	This paper proposes GVGEN, a novel diffusion-based framework for generating 3D Gaussian representations directly from text descriptions.	Generating 3D models from text descriptions is important for various industries. Existing methods either lack diversity, require long inference times, or produce low-resolution assets. This work aims to overcome these limitations by directly generating 3D Gaussians from text.	The proposed method utilizes a two-stage approach: 1) GaussianVolume Fitting: Organizes 3D Gaussian points into a structured volumetric form (GaussianVolume) using a novel Candidate Pool Strategy for pruning and densification. 2) Text-to-3D Generation: Employs a coarse-to-fine pipeline. First, a diffusion model generates a coarse geometry volume (Gaussian Distance Field). Then, a 3D U-Net predicts detailed Gaussian attributes based on the generated geometry and text input.	GVGEN demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods. The method achieves a fast generation speed (approximately 7 seconds). GVGEN effectively balances generation quality and efficiency.	The performance of GVGEN is limited when presented with text inputs that significantly deviate from the training data domain. Scaling up the model to handle millions of objects for increased diversity presents a challenge due to the time-consuming nature of fitting GaussianVolume for each object.	text-to-3d generation, 3d gaussian splatting, diffusion models, volumetric representation, deep learning
2403.12915 Report	Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model	Jiajie Yang	We introduce the Pyramid Diffusion Model (PDM), a novel architecture designed for ultra-high-resolution image synthesis. PDM utilizes a pyramid latent representation, providing a broader design space that enables more flexible, structured, and efficient perceptual compression which enable AutoEncoder and Network of Diffusion to equip branches and deeper layers. To enhance PDM's capabilities for generative tasks, we propose the integration of Spatial-Channel Attention and Res-Skip Connection, along with the utilization of Spectral Norm and Decreasing Dropout Strategy for the Diffusion Network and AutoEncoder. In summary, PDM achieves the synthesis of images with a 2K resolution for the first time, demonstrated on two new datasets comprising images of sizes 2048x2048 pixels and 2048x1024 pixels respectively. We believe that this work offers an alternative approach to designing scalable image generative models, while also providing incremental reinforcement for existing frameworks.	The paper introduces Pyramid Diffusion Model (PDM), a novel architecture for ultra-high-resolution image synthesis using a pyramid latent representation, enabling efficient perceptual compression and flexible design.	Existing models struggle to synthesize ultra-high-resolution images due to limitations in latent representation and network design. PDM addresses these limitations to enable 2K resolution image generation.	PDM replaces the single latent in LDMs with a pyramid latent structure, utilizes a Pyramid UNet with branches for each latent scale, and incorporates Spatial-Channel Attention, Res-Skip Connections, Spectral Norm, and a Decreasing Dropout Strategy.	Achieved synthesis of 2K resolution images for the first time. Introduced two new datasets, SCAPES2K and PEOPLE2K, containing images with 2048x2048 and 2048x1024 pixels. Visualization of pyramid latent representations shows that different resolutions contribute to distinct image aspects (global concept, local concept, details).	Limited evaluation of FID scores on benchmark datasets. Further research on Concept Aliasing and its impact on generative models.	diffusion model, image synthesis, high-resolution images, pyramid latent representation, spatial-channel attention
2403.12906 Report	TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation	Yufei Liu, Junwei Zhu, Junshu Tang, Shijie Zhang, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yunsheng Wu, Dongjin Huang	Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions.	TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model for texturing 3D humans from text or image inputs.	Existing methods for generating 3D human textures are limited by generation speed, consistency, and quality, leading to data scarcity in existing datasets.	TexDreamer utilizes a two-step training strategy: 1) Text-to-UV (T2UV) adapts a large text-to-image model to a semantic UV structure with an efficient texture adaptation finetuning strategy, and 2) Image-to-UV (I2UV) translates image features to textual features using a novel feature translator module, enabling texture prediction from images in the T2UV's text feature space. The model is trained on a novel dataset called ATLAS, the largest high-resolution 3D human texture dataset.	TexDreamer outperforms state-of-the-art methods in generating high-fidelity textures from both text and image inputs. The model demonstrates high text consistency, effectively capturing identity and clothing details from textual descriptions. TexDreamer enables efficient texture editing and integration with complex 3D human meshes.	I2UV's performance on real-life cases may be limited due to its reliance on semantic features rather than precise 2D image segmentation. The realistic texture generation capability raises ethical concerns about potential misuse, such as creating deepfakes.	human texture, multimodal, texture synthesis, text-to-3d, image-to-uv
2403.12803 Report	DreamDA: Generative Data Augmentation with Diffusion Models	Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu	The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor. Compared to conventional Data Augmentation (DA) techniques (e.g. cropping and rotation), exploiting prevailing diffusion models for data generation has received scant attention in classification tasks. Existing generative DA methods either inadequately bridge the domain gap between real-world and synthesized images, or inherently suffer from a lack of diversity. To solve these issues, this paper proposes a new classification-oriented framework DreamDA, which enables data synthesis and label generation by way of diffusion models. DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds and perturbing their reverse diffusion process. In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels and training classifiers using the synthesized data. Extensive experiments across four tasks and five datasets demonstrate consistent improvements over strong baselines, revealing the efficacy of DreamDA in synthesizing high-quality and diverse images with accurate labels. Our code will be available at https://github.com/yunxiangfu2001/DreamDA.	This paper proposes DreamDA, a novel data augmentation framework that leverages pre-trained diffusion models to generate diverse images adhering to the real data distribution for improved image classification.	High-quality, large-scale data collection is crucial for deep learning but costly. DreamDA addresses this by synthesizing diverse and reliable training data, enhancing model performance.	DreamDA perturbs the reverse diffusion process of pre-trained diffusion models by injecting noise into the U-Net bottleneck. It introduces AMST, a self-training paradigm using multiple classifiers to generate reliable pseudo labels for synthesized data, improving label accuracy.	DreamDA consistently outperforms conventional and diffusion-based data augmentation techniques, demonstrating superior performance on multiple datasets and tasks. DreamDA effectively mitigates the domain gap between synthetic and real data, achieving excellent FID and MMD scores. The paper provides extensive ablation studies, demonstrating the effectiveness of individual components, such as latent perturbation and AMST.	The paper acknowledges the computational cost of data generation and suggests exploring faster sampling techniques in future work. The authors emphasize the need to carefully consider ethical implications when applying generative data augmentation in real-world scenarios.	data augmentation, diffusion models, image classification, self-training, generative models
2403.12760 Report	WaveFace: Authentic Face Restoration with Efficient Frequency Recovery	Yunqi Miao, Jiankang Deng, Jungong Han	Although diffusion models are rising as a powerful solution for blind face restoration, they are criticized for two problems: 1) slow training and inference speed, and 2) failure in preserving identity and recovering fine-grained facial details. In this work, we propose WaveFace to solve the problems in the frequency domain, where low- and high-frequency components decomposed by wavelet transformation are considered individually to maximize authenticity as well as efficiency. The diffusion model is applied to recover the low-frequency component only, which presents general information of the original image but 1/16 in size. To preserve the original identity, the generation is conditioned on the low-frequency component of low-quality images at each denoising step. Meanwhile, high-frequency components at multiple decomposition levels are handled by a unified network, which recovers complex facial details in a single step. Evaluations on four benchmark datasets show that: 1) WaveFace outperforms state-of-the-art methods in authenticity, especially in terms of identity preservation, and 2) authentic images are restored with the efficiency 10x faster than existing diffusion model-based BFR methods.	This paper proposes WaveFace, an efficient blind face restoration approach that restores authentic images by recovering their frequency components individually.	Existing diffusion models for BFR are computationally expensive and often fail to preserve identity and fine-grained facial details. This work addresses these limitations by operating in the frequency domain.	The method uses Discrete Wavelet Transform (DWT) to decompose images. It then leverages a Low-frequency Conditional Denoising (LCD) module with a conditional diffusion model for the low-frequency component and a High-Frequency Recovery (HFR) module for high-frequency components at multiple levels.	WaveFace outperforms state-of-the-art methods in authenticity, particularly in identity preservation. It achieves up to 10x faster restoration speeds compared to existing diffusion model-based BFR methods. The method effectively balances efficiency and restoration quality by carefully selecting the DWT decomposition level.	There's a significant difference between simulated and real-world degradations, impacting performance on real images. Future work will focus on simulating more realistic degradations and exploring better evaluation metrics for BFR.	blind face restoration, diffusion models, frequency domain, wavelet transform, identity preservation
2403.12722 Report	HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting	Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao	Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects. Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of our approach.	Introduces HUGS, a novel pipeline leveraging 3D Gaussian Splatting for holistic urban scene understanding from posed RGB images.	Enables holistic urban scene representation for applications like autonomous driving simulation, encompassing novel view synthesis, semantic parsing, and dynamic object tracking, without relying on expensive LiDAR or annotations.	Decomposes scenes into static and dynamic 3D Gaussians, modeling moving objects' motion with a physically-constrained unicycle model. Jointly optimizes geometry, appearance, semantics, and motion using RGB images, noisy 2D semantic labels, and optical flow.	Achieves state-of-the-art novel view synthesis on dynamic scenes, even with noisy 3D bounding box inputs. Enables high-quality novel view semantic synthesis, achieving comparable performance to state-of-the-art on KITTI-360. Allows for accurate 3D semantic reconstruction, outperforming Semantic Nerfacto in terms of geometric quality and semantic accuracy.	Limited rotation capability for reconstructed dynamic objects. Lacks control over aspects like lighting editing.	3d scene understanding, gaussian splatting, novel view synthesis, semantic reconstruction, dynamic scenes
2403.12706 Report	AnimateDiff-Lightning: Cross-Model Diffusion Distillation	Shanchuan Lin, Xiao Yang	We present AnimateDiff-Lightning for lightning-fast video generation. Our model uses progressive adversarial diffusion distillation to achieve new state-of-the-art in few-step video generation. We discuss our modifications to adapt it for the video modality. Furthermore, we propose to simultaneously distill the probability flow of multiple base diffusion models, resulting in a single distilled motion module with broader style compatibility. We are pleased to release our distilled AnimateDiff-Lightning model for the community's use.	Presents AnimateDiff-Lightning, a lightning-fast video generation model using progressive adversarial diffusion distillation for few-step video generation, and introduces cross-model diffusion distillation to enhance the generalization ability of the distilled motion module across diverse stylized base models.	Addresses the speed limitations of video generation models, particularly AnimateDiff, to make them more practical and widely adoptable by reducing the time and computational cost of the generation process.	Adapts progressive adversarial diffusion distillation to the video modality by simultaneously distilling the probability flow of multiple base diffusion models (Stable Diffusion, RealisticVision, epiCRealism, ToonYou, IMP, Counterfeit) using a shared motion module, and employs a flow-conditional video discriminator to ensure sharp and flow-preserving predictions.	Achieves better quality video generation in fewer inference steps compared to prior video distillation methods, particularly AnimateLCM. Demonstrates superior generalization ability to unseen stylized base models due to cross-model distillation. Retains compatibility with key AnimateDiff features, including Motion LoRAs, different aspect ratios, and video-to-video generation with ControlNet.	Experiences heavy noise artifacts in 1-step generation and brightness flickers in 2-step generation due to limitations in the epsilon formulation. Shows a higher probability of generating bad cases when the aspect ratio deviates significantly from the square aspect ratio used during distillation training.	video generation, diffusion models, model distillation, cross-model distillation, animatediff
2403.12658 Report	Tuning-Free Image Customization with Image and Text Guidance	Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng	Despite significant advancements in image customization with diffusion models, current methods still have several limitations: 1) unintended changes in non-target areas when regenerating the entire image; 2) guidance solely by a reference image or text descriptions; and 3) time-consuming fine-tuning, which limits their practical application. In response, we introduce a tuning-free framework for simultaneous text-image-guided image customization, enabling precise editing of specific image regions within seconds. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. To achieve this, we propose an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. To our knowledge, this is the first tuning-free method that concurrently utilizes text and image guidance for image customization in specific regions. Our approach outperforms previous methods in both human and quantitative evaluations, providing an efficient solution for various practical applications, such as image synthesis, design, and creative photography.	This paper proposes a novel tuning-free framework for image customization that utilizes both text and reference images to edit specific regions within an image.	Current image customization methods have limitations such as unintended changes in non-target areas, reliance on a single guidance modality (text or image), and time-consuming fine-tuning. This work addresses these limitations by enabling precise region-based editing with dual guidance in a tuning-free manner.	The method utilizes a three-stream denoising architecture with a self-attention blending strategy. It inverts a collage of the target region and reference subject to obtain latent codes. Then, it blends features from reconstruction, text-guided, and noise-injected streams during denoising to generate the customized image.	The proposed method outperforms existing single-modality and two-step methods in both qualitative and quantitative comparisons. It achieves high fidelity to reference subjects while enabling text-driven attribute editing. User studies confirm the effectiveness of the approach, showing superior performance in fidelity, quality, and text alignment.	The method faces challenges in editing scenes with significant perspective changes or non-rigid motion. Future work could explore incorporating perspective and motion guidance for more complex editing scenarios.	image editing, image customization, diffusion model, text-image guidance, tuning-free
2403.12585 Report	LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing	Yazeed Alharbi, Peter Wonka	We present a novel, training-free approach for textual editing of real images using diffusion models. Unlike prior methods that rely on computationally expensive finetuning, our approach leverages LAtent SPatial Alignment (LASPA) to efficiently preserve image details. We demonstrate how the diffusion process is amenable to spatial guidance using a reference image, leading to semantically coherent edits. This eliminates the need for complex optimization and costly model finetuning, resulting in significantly faster editing compared to previous methods. Additionally, our method avoids the storage requirements associated with large finetuned models. These advantages make our approach particularly well-suited for editing on mobile devices and applications demanding rapid response times. While simple and fast, our method achieves 62-71\% preference in a user-study and significantly better model-based editing strength and image preservation scores.	This paper presents LASPA, a novel training-free method for single-image editing using text-to-image diffusion models that leverages latent spatial alignment for fast and efficient editing.	Existing single-image editing methods using diffusion models are computationally expensive, requiring finetuning or complex optimization, making them impractical for real-time applications and resource-constrained devices.	LASPA leverages the spatial latent of diffusion models by aligning it with the reference image features during the reverse diffusion process. This allows preserving image details while incorporating textual edits without modifying the model's parameters.	LASPA achieves significantly faster editing speeds compared to previous methods (under 6 seconds). Qualitative and quantitative evaluations demonstrate superior image preservation and editing strength compared to state-of-the-art methods. The method is shown to be versatile and promising for various applications such as video editing, facial editing, and editing with faster diffusion models.	LASPA can benefit from parameter tuning for specific edits and seed selection. Achieving large pose changes remains a challenge.	text-to-image, diffusion models, single-image editing, latent spatial alignment, fast editing
2403.12550 Report	RGBD GS-ICP SLAM	Seongbo Ha, Jiung Yeon, Hyeonwoo Yu	Simultaneous Localization and Mapping (SLAM) with dense representation plays a key role in robotics, Virtual Reality (VR), and Augmented Reality (AR) applications. Recent advancements in dense representation SLAM have highlighted the potential of leveraging neural scene representation and 3D Gaussian representation for high-fidelity spatial representation. In this paper, we propose a novel dense representation SLAM approach with a fusion of Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS). In contrast to existing methods, we utilize a single Gaussian map for both tracking and mapping, resulting in mutual benefits. Through the exchange of covariances between tracking and mapping processes with scale alignment techniques, we minimize redundant computations and achieve an efficient system. Additionally, we enhance tracking accuracy and mapping quality through our keyframe selection methods. Experimental results demonstrate the effectiveness of our approach, showing an incredibly fast speed up to 107 FPS (for the entire system) and superior quality of the reconstructed map.	This paper proposes RGBD GS-ICP SLAM, a novel real-time dense representation SLAM that integrates Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS) for accurate and efficient tracking and mapping.	Existing dense SLAM methods using neural scene representation or 3D Gaussian representation struggle to balance speed and accuracy, often relying on computationally expensive rendering or decoupled approaches. This paper addresses this limitation.	The method leverages the shared representation of 3D Gaussians between G-ICP tracking and 3DGS mapping. It directly utilizes 3D information from G-ICP for tracking, eliminates redundant covariance computations, and introduces scale alignment techniques for smooth information transfer between the two processes. Additionally, it employs dynamic keyframe selection for both tracking and mapping to optimize performance.	The method achieves state-of-the-art camera pose estimation accuracy on the Replica dataset, outperforming previous methods by over 50%. It demonstrates incredibly fast system speed, up to 107 FPS, while maintaining high-quality map reconstruction, significantly surpassing existing methods in speed. The paper provides comprehensive ablation studies, validating the contribution of each proposed component (scale regularization, scale alignment, keyframe selection, and local minima avoidance) to the overall performance.	The method heavily relies on depth information, making it potentially susceptible to noise in real-world scenarios with low-quality depth sensors. Future work includes exploring the trade-off between speed and robustness by incorporating RGB information to enhance performance in challenging environments.	slam, 3d gaussian splatting, g-icp, dense representation, real-time
2403.12532 Report	UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All	Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang	We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities -- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.	Presents UniBind, a novel approach for multi-modal learning that uses LLM-augmented contrastive learning and modality-agnostic embedding centers to achieve a unified and balanced representation space.	Existing methods often rely on image-centric representation spaces, leading to unbalanced performance across modalities. Additionally, using only category names as embedding centers fails to fully capture the semantic richness of multi-modal data.	1) Constructs a knowledge base of text descriptions using LLMs and multi-modal LLMs for each category and multi-modal data. 2) Adaptively builds class-wise embedding centers by selecting the most relevant text embeddings from the knowledge base. 3) Aligns multi-modal embeddings to these embedding centers via contrastive learning.	Achieves significant performance improvements on zero-shot recognition tasks across seven modalities, averaging +6.27% gain in top-1 accuracy. Outperforms supervised methods on 10 out of 12 benchmarks for fine-tuning recognition, particularly excelling in datasets with many categories. Demonstrates substantial improvement in cross-modal retrieval tasks, with +17.96% gain on top-20 recall for event-to-image retrieval.	The robustness of the LLM-augmented method requires further investigation and enhancement. Future work will explore leveraging LLMs to enhance the robustness of the modality-agnostic representation space.	multi-modal learning, representation learning, contrastive learning, large language models, knowledge base
2403.12510 Report	Generalized Consistency Trajectory Models for Image Manipulation	Beomsu Kim, Jaemin Kim, Jeongsol Kim, Jong Chul Ye	Diffusion-based generative models excel in unconditional generation, as well as on applied tasks such as image editing and restoration. The success of diffusion models lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. Thus, this work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing. Code: \url{https://github.com/1202kbs/GCTM}	The paper proposes Generalized Consistency Trajectory Models (GCTMs), which extend Consistency Trajectory Models (CTMs) to enable one-step translation between arbitrary distributions via ODEs.	Diffusion models, while powerful, are computationally intensive. CTMs offer fast sampling but are limited to Gaussian noise to data transformations. GCTMs overcome this by learning ODEs between any two distributions, enabling various image manipulation tasks efficiently.	The paper leverages Flow Matching theory to generalize CTMs. It proposes a new parametrization for the FM ODE solution, enabling traversal between arbitrary distributions. GCTMs are trained by minimizing a combination of distillation and denoising score-matching losses.	GCTMs with Optimal Transport coupling significantly accelerate training convergence in unconditional generation. In image-to-image translation, GCTMs achieve superior performance with NFE=1, outperforming SDE-based methods and GANs in terms of image quality and faithfulness. GCTMs excel in image restoration, surpassing DPS and CM in zero-shot settings, and achieving a good balance between perception and distortion metrics in supervised settings.	GCTMs haven't yet reached state-of-the-art performance in unconditional generation. Further hyperparameter tuning, particularly inspired by iCMs, is suggested as future work to potentially boost performance.	diffusion models, flow matching, consistency models, image manipulation, fast sampling
2403.12488 Report	DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM	Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Jian Wu, Philip Torr	We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting.	DetToolChain, a novel prompting paradigm using visual and reasoning prompts with a chain-of-thought approach, is proposed to unleash the zero-shot object detection ability of MLLMs.	Existing methods for detection with MLLMs rely on finetuning, which is computationally expensive and infeasible for closed-source models. This work explores the potential of MLLMs as zero-shot detectors through prompting.	The methodology involves: (1) Visual processing prompts (regional amplifier, spatial measurement standard, scene image parser) to pre-process images, (2) Detection reasoning prompts for result diagnosis and next prompt selection, and (3) A multimodal detection Chain-of-Thought (Det-CoT) to manage the detection process.	DetToolChain significantly improves GPT-4V and Gemini performance on open-vocabulary detection, outperforming SOTA methods by a large margin (e.g., +21.5% AP50 on COCO Novel class set). It achieves state-of-the-art performance on described object detection (+14.5% AP on D-cube FULL set) and referring expression comprehension (+24.23% Acc on RefCOCO val set) tasks. Ablation studies demonstrate the effectiveness of individual visual prompting tools and highlight the superiority of Det-CoT over other CoT methods.	The sequential processing of prompts in DetToolChain limits parallel computation, impacting efficiency. The framework's reliance on large-scale MLLMs and extensive message histories raises concerns about scalability and cost.	multimodal large language model, prompting, object detection, chain-of-thought, zero-shot learning
2403.12431 Report	Geometric Constraints in Deep Learning Frameworks: A Survey	Vibhas K Vats, David J Crandall	Stereophotogrammetry is an emerging technique of scene understanding. Its origins go back to at least the 1800s when people first started to investigate using photographs to measure the physical properties of the world. Since then, thousands of approaches have been explored. The classic geometric techniques of Shape from Stereo is built on using geometry to define constraints on scene and camera geometry and then solving the non-linear systems of equations. More recent work has taken an entirely different approach, using end-to-end deep learning without any attempt to explicitly model the geometry. In this survey, we explore the overlap for geometric-based and deep learning-based frameworks. We compare and contrast geometry enforcing constraints integrated into a deep learning framework for depth estimation or other closely related problems. We present a new taxonomy for prevalent geometry enforcing constraints used in modern deep learning frameworks. We also present insightful observations and potential future research directions.	This paper surveys the use of geometric constraints in deep learning frameworks for depth estimation and related problems. It introduces a new taxonomy for these constraints and discusses their integration into various frameworks.	While deep learning has advanced depth estimation, most methods rely heavily on supervised learning and large datasets. This paper explores how integrating geometric constraints can enhance structural consistency and reduce reliance on ground truth data.	The paper reviews a range of geometric constraints, categorizing them and describing their mathematical formulations. It examines their application in different frameworks, including supervised, self-supervised, stereo, multi-view stereo, and monocular depth estimation.	Explicitly modeling geometric constraints, along with supervision signals, enforces structural and occlusion reasoning and cross-view consistency. The integration of geometric constraints can potentially improve depth estimation accuracy, particularly in challenging scenarios like featureless regions or varying lighting conditions. The survey reveals a taxonomy of geometric constraints applicable to deep learning depth estimation, providing a valuable resource for researchers.	The paper primarily focuses on summarizing existing work, with limited discussion on quantitative comparisons of different methods. Further research is needed to explore the optimal combination and integration of various geometric constraints for specific depth estimation tasks.	depth estimation, geometric constraints, multi-view stereo, self-supervised learning, deep learning
2403.12409 Report	ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance	Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu	Generating high-quality 3D assets from a given image is highly desirable in various applications such as AR/VR. Recent advances in single-image 3D generation explore feed-forward models that learn to infer the 3D model of an object without optimization. Though promising results have been achieved in single object generation, these methods often struggle to model complex 3D assets that inherently contain multiple objects. In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. 1) We first perform an in-depth analysis of this ``multi-object gap'' from both model and data perspectives. 2) Next, with reconstructed 3D models of different objects, we seek to adjust their sizes, rotation angles, and locations to create a 3D asset that matches the given image. 3) To automate this process, we apply spatially-aware score distillation sampling (SSDS) from pretrained diffusion models to guide the positioning of objects. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling, and thus achieves more accurate results. Extensive experiments validate ComboVerse achieves clear improvements over existing methods in generating compositional 3D assets.	ComboVerse is a two-stage 3D generation framework that creates complex 3D assets by composing multiple objects, addressing the limitations of existing single-object models.	Current single-image 3D generation methods struggle to model complex assets with multiple objects due to dataset bias and limitations in handling object interactions.	1. Single-object reconstruction: Objects in the input image are segmented, inpainted, and reconstructed individually. 2. Multi-object combination: Objects are automatically combined by optimizing their scale, rotation, and translation, guided by a spatially-aware score distillation sampling (SSDS) loss from pretrained diffusion models.	Outperforms state-of-the-art methods in generating compositional 3D assets from single images. Effectively handles multiple objects, occlusion, and varying camera settings. Achieves better spatial object placement compared to standard SDS methods, as demonstrated by both qualitative and quantitative evaluations.	Faces challenges in creating highly complex scenes with numerous objects. Relies on the quality of the backbone image-to-3D method used for single-object reconstruction.	3d generation, compositional generation, diffusion models, score distillation sampling, spatial awareness
2403.12365 Report	GaussianFlow: Splatting Gaussian Dynamics for 4D Content Creation	Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, Ulrich Neumann	Creating 4D fields of Gaussian Splatting from images or videos is a challenging task due to its under-constrained nature. While the optimization can draw photometric reference from the input videos or be regulated by generative models, directly supervising Gaussian motions remains underexplored. In this paper, we introduce a novel concept, Gaussian flow, which connects the dynamics of 3D Gaussians and pixel velocities between consecutive frames. The Gaussian flow can be efficiently obtained by splatting Gaussian dynamics into the image space. This differentiable process enables direct dynamic supervision from optical flow. Our method significantly benefits 4D dynamic content generation and 4D novel view synthesis with Gaussian Splatting, especially for contents with rich motions that are hard to be handled by existing methods. The common color drifting issue that happens in 4D generation is also resolved with improved Guassian dynamics. Superior visual quality on extensive experiments demonstrates our method's effectiveness. Quantitative and qualitative evaluations show that our method achieves state-of-the-art results on both tasks of 4D generation and 4D novel view synthesis. Project page: https://zerg-overmind.github.io/GaussianFlow.github.io/	This paper introduces Gaussian flow, a differentiable method for directly supervising the dynamics of 3D Gaussians in 4D Gaussian Splatting using optical flow.	Creating 4D Gaussian Splatting fields from images or videos is challenging due to under-constrained scene dynamics, especially from sparse-view or monocular videos. Existing methods lack direct supervision of Gaussian motions, leading to temporal inconsistencies and artifacts.	Gaussian flow connects 3D Gaussian dynamics with 2D pixel velocities. It leverages the rendering process of 3D Gaussian Splatting to splat Gaussian dynamics onto the image plane, enabling direct supervision by matching Gaussian flow with pre-computed optical flow.	Gaussian flow significantly improves 4D content generation and 4D novel view synthesis with Gaussian Splatting. The method excels at handling scenes with rich and fast motions, outperforming existing approaches. Color drifting artifacts common in 4D generation are resolved due to the improved accuracy of Gaussian dynamics.	The current implementation focuses on short-term flow supervision between consecutive frames; exploring long-term supervision could further enhance temporal consistency. The paper primarily focuses on single-view supervision; future work could explore multi-view flow supervision.	4d generation, 4d novel view synthesis, 3d gaussian splatting, dynamic scene, optical flow
2403.12326 Report	Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts	Anh Bui, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung	Generative models have demonstrated remarkable potential in generating visually impressive content from textual descriptions. However, training these models on unfiltered internet data poses the risk of learning and subsequently propagating undesirable concepts, such as copyrighted or unethical content. In this paper, we propose a novel method to remove undesirable concepts from text-to-image generative models by incorporating a learnable prompt into the cross-attention module. This learnable prompt acts as additional memory to transfer the knowledge of undesirable concepts into it and reduce the dependency of these concepts on the model parameters and corresponding textual inputs. Because of this knowledge transfer into the prompt, erasing these undesirable concepts is more stable and has minimal negative impact on other concepts. We demonstrate the effectiveness of our method on the Stable Diffusion model, showcasing its superiority over state-of-the-art erasure methods in terms of removing undesirable content while preserving other unrelated elements.	This paper introduces KPOP, a novel method using learnable parameter prompts in cross-attention layers to remove undesirable concepts from text-to-image generative models while minimizing impact on other concepts.	Training on unfiltered data risks generative models learning and propagating undesirable, unethical or copyrighted content. Existing erasure methods often degrade model performance on related concepts.	KPOP uses a two-step process: 1) Knowledge Transfer: Train the prompt to mimic generation of the undesirable concept. 2) Knowledge Removal: Fine-tune the model to erase the concept, using the prompt to regularize the process and minimize impact on other concepts.	KPOP demonstrates superior performance in erasing object-related concepts while preserving unrelated ones compared to baselines. KPOP effectively mitigates NSFW content generation, achieving lower ratios of exposed body parts in images compared to baselines. KPOP successfully erases artistic style concepts according to CLIP alignment scores, outperforming baselines in erasing while comparably preserving content.	Larger prompt sizes, while improving erasure, can negatively impact the model's ability to preserve unrelated concepts due to softmax normalization. Exploration of alternative prompting mechanisms, such as amortizing the prompt or injecting it before the text encoder, is left for future work.	concept erasure, text-to-image generation, stable diffusion, cross-attention, prompt tuning
2403.12042 Report	Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation	Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua	In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{https://github.com/buxiangzhiren/VD-IT}	This paper explores the potential of pre-trained text-to-video (T2V) diffusion models for video understanding, specifically for the task of Referring Video Object Segmentation (R-VOS). It introduces a novel framework, VD-IT, built upon a fixed pre-trained T2V model, incorporating text-guided image projection and video-specific noise prediction for enhanced feature extraction.	The paper investigates whether the latent representations learned by generative T2V models, which excel in capturing temporal consistency, can benefit video understanding tasks like R-VOS. This exploration aims to advance the understanding and application of generative models in discriminative tasks.	The VD-IT framework utilizes a pre-trained T2V model for feature extraction, employing two key innovations: (1) Text-Guided Image Projection, combining referring text and visual tokens as prompts to enhance feature richness and temporal consistency. (2) Video-Specific Noise Prediction, replacing standard Gaussian noise with predicted video-correlated noise to preserve feature fidelity.	VD-IT achieves state-of-the-art results on four R-VOS benchmarks, demonstrating significant improvements over existing methods, particularly in maintaining temporal consistency. Analysis shows that visual features extracted using VD-IT exhibit better temporal semantic consistency and spatial smoothness compared to those from discriminatively fine-tuned video backbones. Experiments confirm that the use of referring text in feature extraction, coupled with video-specific noise prediction, significantly contributes to enhanced performance.	The current implementation of VD-IT is limited by its computational cost, primarily due to the T2V diffusion model. The framework focuses on single-object R-VOS, requiring further exploration for multi-object scenarios.	referring video object segmentation, text-to-video diffusion models, video understanding, temporal consistency, generative models for discriminative tasks
2403.12038 Report	Zero-Shot Image Feature Consensus with Deep Functional Maps	Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas	Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.	The paper presents a zero-shot framework for image correspondence that leverages functional maps to improve the coherence and accuracy of matches derived from pre-trained large-scale vision models.	Existing methods based on nearest neighbor search in feature space often lack global structure awareness, leading to distortions and discontinuities in the correspondence maps. This paper addresses this limitation by representing correspondences as functional maps, which capture global deformations more effectively.	The method utilizes two sets of features from pre-trained networks. It constructs a graph Laplacian from one set to define a function basis and optimizes a functional map on this basis using the second set as a regularizer. The optimization incorporates descriptor preservation, compactness, and bijectivity constraints.	The framework outperforms previous zero-shot methods on dense correspondence benchmarks, demonstrating both improved accuracy and smoothness. It effectively fuses features from different networks and layers, outperforming simple concatenation approaches. The method shows promising results in applications like keypoint matching and affordance transfer.	The current framework is better suited for object-centric images than complex scenes, as it relies on the manifold assumption. Future work could explore extending the method to handle complex scenes by incorporating segmentation or exploring matches between quotient spaces.	functional map, zero-shot image matching, dense correspondence, emergent feature property, feature fusion
2403.12036 Report	One-Step Image Translation with Text-to-Image Models	Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu	In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.	This paper introduces a novel one-step image translation method using text-to-image diffusion models, achieving efficient adaptation to new tasks and domains through adversarial learning objectives.	This approach addresses limitations of existing conditional diffusion models, namely slow inference speed and reliance on paired training data.	The method leverages a pre-trained one-step diffusion model (SD-Turbo), adapting it via: 1) Direct conditioning input to the noise encoder, 2) Consolidating encoder, UNet, and decoder into a single trainable architecture with LoRA, 3) Incorporating skip connections for detail preservation.	Outperforms GAN-based and diffusion-based methods in unpaired image translation tasks (e.g., day-night conversion, weather effects). Achieves comparable results to ControlNet in paired settings (e.g., Sketch2Photo, Edge2Image) with single-step inference. Enables diverse output generation by interpolating between noise maps and encoder outputs.	Lacks control over guidance strength due to the absence of classifier-free guidance in the backbone model. Memory intensive training due to cycle-consistency loss and high-capacity generators.	image translation, diffusion models, text-to-image synthesis, adversarial learning, one-step inference
2403.12035 Report	CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility	Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Kam-Fai Wong, Lei Zhang	Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [cococozibojia.github.io](cococozibojia.github.io).	This paper proposes CoCoCo, a novel text-guided video inpainting model that improves upon existing methods by enhancing consistency, controllability, and compatibility.	Existing video generation methods struggle with maintaining consistency across frames, aligning generated content with text prompts, and integrating personalized text-to-image models. CoCoCo addresses these limitations to improve text-guided video inpainting.	CoCoCo introduces a motion capture module with damped global attention and textual cross-attention, employs an instance-aware region selection strategy, and utilizes a task vector combination approach to adapt personalized text-to-image models.	CoCoCo demonstrates superior background preservation and temporal consistency compared to baselines. The instance-aware region selection and textual cross-attention significantly improve text-alignment capabilities, as evidenced by CLIP score. The proposed method successfully integrates personalized text-to-image models, allowing for customized content generation within inpainted regions.	The optimal parameters for integrating personalized models may vary depending on the specific models used. Further research can explore extending the compatibility to a wider range of pretrained models.	video inpainting, text-guided synthesis, motion consistency, text-video alignment, personalized models
2403.12034 Report	VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models	Junlin Han, Filippos Kokkinos, Philip Torr	This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.	Presents VFusion3D, a novel paradigm for building scalable 3D generative models by leveraging pre-trained video diffusion models as 3D data generators.	Addresses the obstacle of limited 3D data availability by utilizing the vast knowledge base of video diffusion models trained on extensive text, image, and video data.	1. Fine-tunes a video diffusion model (EMU Video) with rendered multi-view videos from a 3D dataset to generate 3D-consistent multi-view sequences. 2. Creates a large-scale synthetic multi-view dataset using text prompts and the fine-tuned EMU Video. 3. Trains a feed-forward 3D generative model (VFusion3D) using the synthetic dataset and fine-tunes it with the original 3D data.	VFusion3D generates high-quality 3D assets from a single image in seconds. Outperforms state-of-the-art feed-forward 3D generative models in user studies and automated metrics. Demonstrates the scalability of learning 3D generative models from synthetic multi-view data generated by video diffusion models.	Limited performance of the fine-tuned video diffusion model in generating multi-view sequences for certain object categories like vehicles and text. Future work includes exploring stronger video diffusion models, larger and more diverse 3D datasets, and advancements in feed-forward 3D generative model architectures.	3d generative models, video diffusion models, synthetic data generation, multi-view synthesis, large-scale training
2403.12032 Report	Generic 3D Diffusion Adapter Using Controlled Multi-View Editing	Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas	Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.	This paper introduces MVEdit, a generic framework for adapting pre-trained 2D image diffusion models to enable 3D-aware diffusion for high-quality textured mesh generation.	Open-domain 3D object synthesis lags behind image synthesis due to limited data and high computational complexity. Existing multi-view diffusion methods often fall short in 3D consistency, visual quality, or efficiency.	MVEdit employs a novel training-free 3D Adapter within an ancestral sampling process. This adapter fuses multi-view 2D images into a coherent 3D representation, using either NeRF or mesh, to control subsequent 2D denoising steps for 3D consistency without sacrificing image quality.	MVEdit achieves state-of-the-art results in both image-to-3D and text-guided texture generation, outperforming previous methods in visual quality and efficiency. The 3D Adapter effectively resolves 3D inconsistencies in multi-view images, leading to more accurate and detailed 3D reconstructions. The authors also introduce StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from Stable Diffusion, which can be used to initialize MVEdit for efficient domain-specific generation.	The 3D-to-3D editing pipeline can still suffer from the Janus problem, especially when the degree of editing is high. The off-the-shelf ControlNets used in the 3D Adapter may introduce minor inconsistencies or biases.	diffusion models, 3d generation, texture synthesis, multi-view consistency, 3d editing
2403.12028 Report	Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail	Mingjin Chen, Junhao Chen, Xiaojun Ye, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, Hao Zhao	3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.	Ultraman, a novel 3D human reconstruction framework that reconstructs high-quality body meshes with detailed textures from single front-view images.	Existing methods are time-consuming and struggle to capture detailed appearance, especially for clothed humans. Ultraman addresses these limitations by achieving faster and more detailed reconstruction.	The framework consists of three modules: 1) Mesh Reconstruction: Generates a 3D human mesh from the input image. 2) Multi-view Image Generation: Uses a diffusion-based model to synthesize consistent images from unobserved viewpoints guided by depth, text prompts, and the input image. 3) Texturing: Projects the generated multi-view images onto the mesh's texture space, ensuring consistency and smoothing seams.	Ultraman reconstructs high-quality 3D human models with detailed textures in 20-30 minutes, outperforming state-of-the-art methods in terms of speed (93% faster) and visual quality. The multi-view image generation module, guided by VQA prompts and depth information, effectively synthesizes realistic textures for unseen areas, improving consistency between front and back views. Quantitative evaluations on standard datasets demonstrate Ultraman's superiority in capturing geometric details and generating high-fidelity textures.	The current view selection strategy might not fully cover all details for complex poses. Exploring alternative texturing techniques to further enhance texture quality and reduce artifacts.	3d human reconstruction, single-image reconstruction, diffusion models, texture synthesis, multi-view consistency
2403.12019 Report	LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation	Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, Chen Change Loy	The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.	This paper introduces LN3Diff, a novel framework for fast and generic conditional 3D generation that utilizes a 3D-aware variational autoencoder (VAE) to encode images into a compact latent space for efficient 3D diffusion learning.	Existing methods for 3D diffusion face challenges in scalability, efficiency, and generalizability due to reliance on high-dimensional neural fields and limitations in handling conditional generation.	LN3Diff employs a 3D-aware VAE to compress input images into a lower-dimensional latent space. A transformer-based decoder then reconstructs high-capacity 3D neural fields from this latent space. A diffusion model is trained on this compact latent space, enabling efficient conditional 3D generation.	LN3Diff achieves state-of-the-art 3D generation performance on ShapeNet, outperforming GAN-based and other 3D diffusion methods. It exhibits superior performance in monocular 3D reconstruction and conditional generation across ShapeNet, FFHQ, and Objaverse datasets. LN3Diff surpasses existing 3D diffusion approaches in inference speed, achieving 3x faster generation without per-instance optimization.	The monocular encoder struggles with challenging 3D scenes, suggesting the need for a multi-view encoder. The reliance on volume rendering poses memory constraints; exploring more efficient 3D representations like 3DGS is a potential future direction.	3d generation, 3d reconstruction, latent diffusion model, neural rendering, variational autoencoder
2403.12015 Report	Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation	Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach	Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.	This paper presents Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach for diffusion models that utilizes generative features from pretrained latent diffusion models, enabling high-resolution multi-aspect ratio image synthesis.	Diffusion models, while powerful for image and video synthesis, suffer from slow inference speed. LADD addresses this by enabling fast, single-step inference while maintaining high image quality.	LADD operates in latent space, unifying the discriminator and teacher model, and leverages synthetic data for training, simplifying the distillation process and enhancing performance.	SD3-Turbo, a fast, distilled version of Stable Diffusion 3, achieves state-of-the-art text-to-image generation quality in just four sampling steps. LADD demonstrates stable scaling behavior, with larger student models significantly impacting performance. The versatility of LADD is demonstrated in image editing and inpainting tasks, achieving comparable results to the teacher model in a single step.	While achieving fast inference, SD3-Turbo exhibits a slight reduction in prompt alignment compared to the teacher model. In image editing, the lack of adjustable image and text guidance strengths limits controllability.	diffusion models, image synthesis, model distillation, adversarial training, latent space
2403.12010 Report	VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model	Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang	Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.	This paper proposes VideoMV, a method for consistent dense multi-view image generation by fine-tuning pre-trained video generative models and introducing 3D-Aware Denoising Sampling.	Creating multi-view consistent images is crucial for 3D content creation, but existing methods struggle with efficiency, consistency, or generalizability. This paper leverages the inherent temporal consistency in video generation models to improve upon these limitations.	The method consists of three stages: 1) Fine-tuning a pre-trained video generative model on rendered multi-view images with camera pose conditioning. 2) Training a feed-forward network to reconstruct 3D models from noisy multi-view images. 3) Applying 3D-Aware Denoising Sampling which incorporates rendered views from the reconstructed 3D model into the denoising loop.	VideoMV achieves state-of-the-art results on text-based and image-based multi-view generation benchmarks, outperforming existing methods in image quality, consistency, and efficiency. The method can generate 24 consistent views in just 5 seconds, enabling applications like dense view reconstruction and distillation-based 3D generation. Experiments demonstrate that VideoMV generalizes well to unseen prompts and web images.	The reconstruction module currently uses a sparse view setup due to computational constraints, limiting its ability to fully leverage dense view information. Further exploration is needed to optimize the distillation sampling pipeline for dense views, potentially leading to even higher-quality 3D reconstructions.	multi-view image generation, 3d-aware denoising, video generative models, 3d reconstruction, novel view synthesis
2403.12008 Report	SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion	Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani	We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.	Presents Stable Video 3D (SV3D), a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object, enabling novel view synthesis (NVS) and 3D generation.	Addresses limitations in existing 3D generation methods, which suffer from limited views or inconsistent NVS, by adapting a high-resolution, image-conditioned video diffusion model for multi-view consistency and generalization.	Finetunes Stable Video Diffusion (SVD) to generate orbital videos conditioned on a single image and camera poses, utilizing static and dynamic orbits, triangular CFG scaling, and a two-stage 3D optimization process with a disentangled illumination model and masked score distillation sampling (SDS) loss.	SV3D achieves state-of-the-art performance on NVS, demonstrating high multi-view consistency, generalization to real-world images, and camera pose controllability. The proposed 3D generation pipeline produces high-quality meshes with intricate geometric and texture details. Ablation studies confirm the benefits of progressive finetuning, dynamic orbits, disentangled illumination, and masked SDS loss.	SV3D is currently limited to two degrees of freedom (elevation and azimuth) in camera control. The model exhibits inconsistency for mirror-like reflective surfaces, and the shading model doesn't account for such surfaces.	novel view synthesis, 3d generation, video diffusion models, score distillation sampling, multi-view consistency
2403.12002 Report	DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing	Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye	Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.	DreamMotion presents a novel approach for zero-shot video editing that leverages score distillation sampling from pre-trained text-to-video diffusion models to inject target appearances into videos while preserving the original structure and motion.	Existing video editing methods struggle to balance introducing new content while maintaining realistic and temporally consistent motion. DreamMotion addresses this challenge by directly optimizing on real video data, bypassing the limitations of traditional denoising processes.	DreamMotion utilizes Video Delta Denoising Score (V-DDS) gradients to gradually inject target appearances while employing a space-time self-similarity regularization technique. This regularization minimizes structural deviations by aligning spatial self-similarities and prevents temporal artifacts via temporal self-similarity matching.	DreamMotion successfully injects target appearances while accurately preserving the structure and motion of the source video. The method is model-agnostic, demonstrating effectiveness in both cascaded and non-cascaded video diffusion frameworks. Quantitative and qualitative evaluations, including a user study, confirm that DreamMotion outperforms existing state-of-the-art approaches.	DreamMotion is primarily designed for edits that preserve the overall structure of the original video, limiting its applicability in scenarios requiring significant structural alterations. Future work could explore extending the approach to incorporate more sophisticated masking techniques or investigate alternative self-similarity measures for enhanced performance.	video editing, diffusion models, score distillation sampling, self-similarity, zero-shot learning
2403.11999 Report	HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs	Ting Yao, Yehao Li, Yingwei Pan, Tao Mei	The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.	HIRI-ViT, a novel five-stage Vision Transformer backbone tailored for high-resolution inputs, decomposing typical CNN operations into two parallel branches to achieve cost-efficient scaling.	Scaling up input resolution enhances model capacity but suffers from heavy computational cost in existing ViT backbones.	A five-stage ViT structure with two-branch design (high-resolution branch with fewer convolutions and low-resolution branch with more convolutions) is proposed, coupled with inverted residual downsampling and EMA distillation.	HIRI-ViT achieves state-of-the-art performance on ImageNet-1K with high-resolution inputs, surpassing existing backbones under comparable computational costs. HIRI-ViT demonstrates superior generalizability in downstream tasks like object detection, instance segmentation, and semantic segmentation on COCO and ADE20K datasets. Ablation studies validate the effectiveness of the proposed five-stage structure, two-branch design, and EMA distillation strategy.	Limited improvement observed with six-stage structure. Scaling up Video Vision Transformer with high-resolution inputs remains a challenge.	vision transformer, high-resolution inputs, cnn+vit hybrid backbone, image recognition, dense prediction tasks
2403.11990 Report	GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation	Zhaoyang Lyu, Ben Fei, Jinyi Wang, Xudong Xu, Ya Zhang, Weidong Yang, Bo Dai	Mesh is a fundamental representation of 3D assets in various industrial applications, and is widely supported by professional softwares. However, due to its irregular structure, mesh creation and manipulation is often time-consuming and labor-intensive. In this paper, we propose a highly controllable generative model, GetMesh, for mesh generation and manipulation across different categories. By taking a varying number of points as the latent representation, and re-organizing them as triplane representation, GetMesh generates meshes with rich and sharp details, outperforming both single-category and multi-category counterparts. Moreover, it also enables fine-grained control over the generation process that previous mesh generative models cannot achieve, where changing global/local mesh topologies, adding/removing mesh parts, and combining mesh parts across categories can be intuitively, efficiently, and robustly accomplished by adjusting the number, positions or features of latent points. Project page is https://getmesh.github.io.	This paper introduces GetMesh, a novel controllable generative model for high-quality mesh generation and manipulation across different categories.	Creating and editing meshes is currently time-consuming and labor-intensive due to their irregular structure. GetMesh addresses this by enabling intuitive and efficient generation and manipulation of meshes.	GetMesh utilizes a varying number of points as the latent representation, re-organized as a triplane representation. Two diffusion models, one for point positions and another for features, learn the data distribution. A triplane-based decoder with a refinement module reconstructs high-quality meshes from the latent representation.	GetMesh generates meshes with rich details, outperforming both single-category and multi-category counterparts. GetMesh allows intuitive control over mesh generation, enabling changes to topology, addition/removal of parts, and combination of parts across categories. GetMesh can be seamlessly combined with off-the-shelf material generation methods for textured mesh generation.	Training GetMesh requires expensive ground-truth 3D data. GetMesh's scalability is validated only on the ShapeNet dataset.	3d generation, controllable generation, diffusion model, mesh generation, mesh manipulation
2403.11956 Report	Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment	Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu	With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at https://github.com/QMME/T2VQA.	This paper introduces T2VQA-DB, the largest subjective text-to-video dataset to date, and proposes T2VQA, a novel transformer-based model for subjective-aligned text-to-video quality assessment.	Existing T2V datasets lack scale and comprehensive human annotations, while current metrics inadequately capture the nuances of human perception, particularly text-video alignment.	T2VQA-DB is built with 10,000 videos from 9 T2V models and 1,000 prompts, annotated with MOS from 27 subjects. T2VQA leverages BLIP and Swin-T for text-video alignment and video fidelity feature extraction, fuses them with cross-attention, and employs an LLM for quality regression.	T2VQA-DB surpasses existing T2V datasets in scale and annotation comprehensiveness. T2VQA outperforms existing T2V metrics and SOTA VQA models on T2VQA-DB, demonstrating its effectiveness. Qualitative analysis reveals T2VQA's superior ability to align with subjective human judgments on video quality.	T2VQA-DB may not fully represent the capabilities of state-of-the-art models like Sora due to resolution and length limitations. Further cross-dataset validation is needed to confirm T2VQA's generalization to other T2V datasets.	text-to-video dataset, video quality assessment, text-to-video generation, multi-modal learning, large language models
2403.11929 Report	LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model	Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu	Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.	Introduces LayerDiff, a layer-collaborative diffusion model for text-guided, multi-layered, and composable image synthesis.	Existing text-to-image models lack object-wise manipulation capability, limiting their use in applications like graphic design where layered compositions are crucial.	LayerDiff employs layer-collaborative attention blocks for inter- and intra-layer information exchange, a layer-specific prompt enhancer to refine content generation using global textual cues, and a self-mask guidance sampling strategy for high-quality multi-layered images.	LayerDiff generates high-fidelity multi-layered images with performance comparable to traditional whole-image generation methods. LayerDiff enables versatile control for various generative applications, including layer-wise composable image manipulation and style transfer. A new data construction pipeline generates high-quality, multi-layered composable images for training LayerDiff, integrating state-of-the-art techniques in image captioning, object localization, segmentation, and inpainting.	Existing multi-layer training data generation pipelines are inefficient, limiting the ability to produce large-scale training data and impacting model performance. The model's performance on three and four-layered images is limited by the availability of training data.	multi-layered composable image synthesis, layer-collaborative diffusion model, layer-specific image editing, text-to-image synthesis, controllable image generation
2403.11909 Report	RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF	Sibi Catley-Chandar, Richard Shaw, Gregory Slabaugh, Eduardo Perez-Pellitero	Recent advances in neural rendering have enabled highly photorealistic 3D scene reconstruction and novel view synthesis. Despite this progress, current state-of-the-art methods struggle to reconstruct high frequency detail, due to factors such as a low-frequency bias of radiance fields and inaccurate camera calibration. One approach to mitigate this issue is to enhance images post-rendering. 2D enhancers can be pre-trained to recover some detail but are agnostic to scene geometry and do not easily generalize to new distributions of image degradation. Conversely, existing 3D enhancers are able to transfer detail from nearby training images in a generalizable manner, but suffer from inaccurate camera calibration and can propagate errors from the geometry into rendered images. We propose a neural rendering enhancer, RoGUENeRF, which exploits the best of both paradigms. Our method is pre-trained to learn a general enhancer while also leveraging information from nearby training images via robust 3D alignment and geometry-aware fusion. Our approach restores high-frequency textures while maintaining geometric consistency and is also robust to inaccurate camera calibration. We show that RoGUENeRF substantially enhances the rendering quality of a wide range of neural rendering baselines, e.g. improving the PSNR of MipNeRF360 by 0.63dB and Nerfacto by 1.34dB on the real world 360v2 dataset.	This paper introduces RoGUENeRF, a geometry-consistent NeRF enhancer that improves the image quality of NeRF renderings while being robust to inaccurate camera calibration.	Current NeRF models struggle to reconstruct high-frequency details due to factors like low-frequency bias and inaccurate camera calibration. Existing enhancement methods are either 2D (geometry-agnostic) or 3D (sensitive to calibration errors). RoGUENeRF leverages both paradigms for improved quality and robustness.	RoGUENeRF uses a 3D+2D alignment with depth maps and camera poses, refined by an optical flow network. A geometry-aware attention module regulates misaligned regions. It's pre-trained on render-GT image pairs and fine-tuned on novel scenes.	RoGUENeRF consistently improves PSNR, SSIM, and LPIPS across six NeRF baselines and three datasets (LLFF, DTU, 360v2). It shows significant qualitative improvements, especially in high-frequency regions like foliage and text. It exhibits robustness to inaccurate camera calibration, outperforming other methods in noisy settings.	A limitation is the storage requirement for training images, potentially prohibitive for large scenes. While faster than baselines, it doesn't yet achieve real-time inference. Future work includes exploring more efficient architectures and larger-scale pre-training datasets.	neural rendering, nerf, image enhancement, 3d vision, robustness
2403.11887 Report	SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules	Xiangyu Chen, Jing Liu, Ye Wang, Pu Perry Wang, Matthew Brand, Guanghui Wang, Toshiaki Koike-Akino	Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.	This paper proposes SuperLoRA, a generalized framework unifying and extending LoRA variants for parameter-efficient fine-tuning of large models.	SuperLoRA addresses limitations of existing LoRA methods by introducing grouping, folding, shuffling, and projection, enabling high flexibility and superior performance in transfer learning, especially with extremely few parameters.	SuperLoRA concatenates weight updates across layers, divides them into groups, reshapes them into regular tensors, applies low-rank decomposition (LoRA, LoNKr, or LoRTA), and projects the results with a fixed mapping function (e.g., fastfood projection).	SuperLoRA achieves 3-10x parameter efficiency compared to LoRA in image classification and generation tasks. Reshaping weight updates to regular tensors significantly improves performance, allowing higher rank usage with fewer parameters. Fixed random projection enables further parameter reduction while maintaining competitive accuracy.	Exploring more efficient projection functions for extremely low-parameter regimes. Applying and evaluating SuperLoRA to various large models (e.g., LLMs) and transfer learning tasks.	low-rank adaptation, parameter-efficient fine-tuning, transfer learning, tensor rank decomposition, efficient ai
2403.11882 Report	ReGenNet: Towards Human Action-Reaction Synthesis	Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng	Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.	This paper introduces the first multi-setting human action-reaction synthesis benchmark and proposes ReGenNet, a diffusion-based model, to generate plausible and instant human reactions.	Modeling human-human interactions, crucial for applications like AR/VR and gaming, is challenging due to its asymmetric, dynamic, synchronous, and detailed nature, which previous works have not addressed holistically.	The authors annotate actor-reactor order in existing datasets (NTU120, Chi3D, InterHuman) and propose ReGenNet, a diffusion model with a Transformer decoder architecture. ReGenNet uses an explicit distance-based interaction loss to model the relative distances of interacted body poses, orientations, and translations.	ReGenNet outperforms baselines in FID, demonstrating closer proximity to real human reaction distributions. The model shows strong generalization ability to unseen actor motions and viewpoint changes. ReGenNet is modular and can be customized for various settings like offline and intention-aware reaction generation.	Current benchmark focuses on atomic action periods and can be extended to handle longer interactions with role transitions. Dataset quality can be improved with less noisy motion capture and more natural facial expressions.	human action-reaction synthesis, human motion generation, diffusion models, transformer decoders, human-human interaction
2403.11878 Report	InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting	Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang Zeng, Ziwei Liu	Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.	Introduces InteX, an interactive text-to-texture synthesis framework using a unified depth-aware inpainting model.	Addresses limitations in existing methods like 3D inconsistency, limited controllability, and lack of user interaction in texture synthesis.	Trains a unified depth-aware inpainting prior model on 3D datasets and employs an iterative texture synthesis algorithm with a user-friendly GUI for interaction.	Generates high-quality textures with enhanced detail and 3D consistency compared to previous methods. Enables interactive visualization, inpainting, and repainting of textures through a user-friendly GUI. Significantly faster (30 seconds per instance) than previous iterative inpainting methods.	Single-view rendering can lead to 3D inconsistencies in the iterative inpainting process. Reliance on auto-generated UV maps when artist-created ones are unavailable can impact texture symmetry.	text-to-texture synthesis, 3d content creation, diffusion models, depth-aware inpainting, interactive design
2403.11868 Report	View-Consistent 3D Editing with Gaussian Splatting	Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang	The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes. Further code and video results are released at http://yuxuanw.me/vcedit/.	Introduces View-consistent Editing (VcEdit), a framework for high-quality 3D Gaussian Splatting (3DGS) editing that ensures multi-view consistency in guidance images to address mode collapse issues.	Image-guided 3DGS editing often suffers from multi-view inconsistency in edited guidance images, leading to mode collapse and visual artifacts.	VcEdit incorporates two novel consistency modules: the Cross-attention Consistency Module (CCM) harmonizes attention maps across views, and the Editing Consistency Module (ECM) calibrates editing outputs using 3DGS. These modules operate within an iterative pattern to refine editing quality.	Effectively addresses multi-view inconsistency in edited images, resulting in superior 3DGS editing quality. Outperforms state-of-the-art methods in both qualitative and quantitative evaluations, including CLIP similarity and user studies. Demonstrates strong adaptability in handling diverse scenes and prompts, ranging from facial details to large-scale scene modifications.	Performance depends on the quality of 2D image editing models, which can sometimes struggle with complex prompts. Limitations in handling non-rigid editing scenarios with drastic shape changes due to high inconsistency in 2D editing outputs.	3d gaussian splatting, 3d editing, multi-view consistency, text-guided image editing, diffusion models
2403.11835 Report	Agent3D-Zero: An Agent for Zero-shot 3D Understanding	Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, Yanyong Zhang	The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.	This paper introduces Agent3D-Zero, an agent framework that leverages Vision-Language Models (VLMs) for zero-shot 3D scene understanding using only multi-view images, eliminating the need for explicit 3D data.	Collecting and annotating 3D data is resource-intensive, limiting the scalability of existing 3D scene understanding methods that rely on 3D data. This work explores a zero-shot approach using VLMs to overcome this limitation.	Agent3D-Zero employs an iterative viewpoint selection process guided by a novel visual prompting technique called Set-of-Line Prompting (SoLP) to enhance the VLM's understanding of spatial relationships within a scene. SoLP uses a bird's-eye view image with superimposed grid lines to aid in viewpoint selection.	Agent3D-Zero outperforms previous state-of-the-art methods on the ScanQA dataset for 3D question answering, demonstrating its effectiveness in zero-shot 3D scene understanding. The method shows promising results in other 3D tasks such as task decomposition, 3D-assisted dialog, and 3D scene captioning, indicating its potential as a general framework for 3D scene analysis. Ablation studies confirm the importance of viewpoint selection and the effectiveness of SoLP in improving the model's performance on 3D understanding tasks.	The current implementation of Agent3D-Zero exhibits limitations in precise and mathematical pose estimation due to constraints in the VLM's ability to interpret highly dense visual prompts. Future research will focus on enhancing the agent's navigation capabilities and extending its application to a wider array of real-world scenarios, further bridging the gap between language models and 3D scene understanding.	3d scene understanding, vision-language models, zero-shot learning, viewpoint selection, visual prompting
2403.11831 Report	BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting	Lingzhe Zhao, Peng Wang, Peidong Liu	While neural rendering has demonstrated impressive capabilities in 3D scene reconstruction and novel view synthesis, it heavily relies on high-quality sharp images and accurate camera poses. Numerous approaches have been proposed to train Neural Radiance Fields (NeRF) with motion-blurred images, commonly encountered in real-world scenarios such as low-light or long-exposure conditions. However, the implicit representation of NeRF struggles to accurately recover intricate details from severely motion-blurred images and cannot achieve real-time rendering. In contrast, recent advancements in 3D Gaussian Splatting achieve high-quality 3D scene reconstruction and real-time rendering by explicitly optimizing point clouds as Gaussian spheres. In this paper, we introduce a novel approach, named BAD-Gaussians (Bundle Adjusted Deblur Gaussian Splatting), which leverages explicit Gaussian representation and handles severe motion-blurred images with inaccurate camera poses to achieve high-quality scene reconstruction. Our method models the physical image formation process of motion-blurred images and jointly learns the parameters of Gaussians while recovering camera motion trajectories during exposure time. In our experiments, we demonstrate that BAD-Gaussians not only achieves superior rendering quality compared to previous state-of-the-art deblur neural rendering methods on both synthetic and real datasets but also enables real-time rendering capabilities. Our project page and source code is available at https://lingzhezhao.github.io/BAD-Gaussians/	This paper introduces BAD-Gaussians, a novel method for reconstructing high-quality 3D scenes from motion-blurred images with inaccurate camera poses, leveraging the explicit representation of 3D Gaussian Splatting and achieving real-time rendering.	Existing neural rendering methods, including NeRF and 3D Gaussian Splatting, struggle to handle motion-blurred images due to the violation of sharp image assumptions and difficulties in accurate camera pose estimation. This hinders their application in real-world scenarios with motion blur.	BAD-Gaussians models the physical image formation process of motion blur and jointly optimizes Gaussian parameters and camera motion trajectories within exposure time. It represents camera trajectories using spline functions and synthesizes blurred images by averaging virtual sharp images rendered from interpolated camera poses along the trajectory. The optimization is achieved by minimizing the photometric error between synthesized and input blurred images.	BAD-Gaussians outperforms previous state-of-the-art deblurring neural rendering methods on both synthetic and real datasets in terms of rendering quality. The method achieves real-time rendering capabilities, surpassing the limitations of implicit neural rendering techniques. BAD-Gaussians effectively recovers accurate camera poses from motion-blurred images, demonstrating robustness against pose inaccuracies.	The performance of BAD-Gaussians can be affected by the accuracy of the initial camera poses and sparse point clouds obtained from COLMAP. The assumption of short exposure time may limit the generalizability of the method to scenarios with very long exposures.	3d gaussian splatting, deblurring, bundle adjustment, differentiable rendering, motion blur
2403.11796 Report	OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation	Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, Li Zhang	3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.	This paper presents OpenOcc, a novel framework that unifies 3D scene reconstruction and open-vocabulary understanding using neural radiance fields, enabling zero-shot semantic segmentation.	Existing 3D reconstruction methods often lack semantic understanding, while traditional 3D scene understanding approaches struggle with open-world scenarios and require extensive labeled data. This work addresses these limitations by integrating both aspects into a single framework.	OpenOcc employs an occupancy representation for efficient geometric reconstruction and distills pre-trained open-vocabulary 2D segmentation features into a 3D language field. A novel semantic-aware confidence propagation (SCP) method mitigates inconsistencies in the language field arising from multi-view observations.	OpenOcc achieves competitive performance on 3D semantic segmentation benchmarks, particularly for small and long-tail objects. The method demonstrates superior accuracy in reconstructing shapes and contours of objects compared to baseline methods. OpenOcc enables efficient open-vocabulary 3D understanding with reduced memory and computational requirements compared to traditional approaches.	The reconstruction quality is limited by the quality of input depth data, which can be noisy or incomplete, especially on datasets like ScanNet. Future work could explore incorporating temporal information and object-level reasoning for improved scene understanding and dynamic scene reconstruction.	3d reconstruction, open vocabulary, semantic segmentation, neural radiance fields, robotic visual navigation
2403.11781 Report	Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm	Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, Bin Li	Drawing on recent advancements in diffusion models for text-to-image generation, identity-preserved personalization has made significant progress in accurately capturing specific identities with just a single reference image. However, existing methods primarily integrate reference images within the text embedding space, leading to a complex entanglement of image and text information, which poses challenges for preserving both identity fidelity and semantic consistency. To tackle this challenge, we propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. Specifically, we introduce identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information while deactivating the original text cross-attention module of the diffusion model. This ensures that the image stream faithfully represents the identity provided by the reference image while mitigating interference from textual input. Additionally, we introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams. This mechanism not only enhances the fidelity of identity and semantic consistency but also enables convenient control over the styles of the generated images. Extensive experimental results on both raw photo generation and style image generation demonstrate the superior performance of our proposed method.	This paper introduces Infinite-ID, a novel identity-preserved personalization method for text-to-image generation that maintains high fidelity to a reference image while ensuring consistency with the text prompt.	Existing methods struggle to balance identity fidelity and semantic consistency due to the entanglement of image and text information. Infinite-ID addresses this challenge to enable diverse applications like personalized AI portraits.	The authors propose an ID-semantics decoupling paradigm. It uses identity-enhanced training with a dedicated image cross-attention module to capture identity information without text interference. A mixed attention mechanism then merges identity and text features during inference. An AdaIN-mean operation further refines style control.	Infinite-ID outperforms state-of-the-art methods in preserving identity fidelity while maintaining semantic consistency. The method demonstrates robust performance across various image resolutions and enables the mixing of multiple identities. It excels in both raw photo generation and style image generation, showcasing its versatility and effectiveness.	The method currently lacks multi-object personalization capabilities. Artifacts may arise when the face is small in the input image, highlighting a limitation inherited from the base diffusion model.	text-to-image generation, identity-preserved personalization, stable diffusion, diffusion models, attention mechanisms
2403.11703 Report	LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images	Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang	Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.	This paper introduces LLaVA-UHD, a large multimodal model capable of efficiently processing images of any aspect ratio and high resolution.	Current LMMs struggle with varied aspect ratios and high-resolution images, limiting their understanding of fine-grained details and increasing hallucination errors. This paper aims to address these limitations.	LLaVA-UHD utilizes (1) an image modularization strategy to divide images into smaller variable-sized slices for efficient encoding, (2) a compression module to condense visual tokens, and (3) a spatial schema to organize slice tokens for LLM processing.	LLaVA-UHD outperforms existing LMMs on 9 benchmarks, including those trained with significantly more data. Compared to the LLaVA-1.5 backbone, LLaVA-UHD achieves a 6.4 accuracy improvement on TextVQA and supports 6 times larger resolution images with less computation. The model demonstrates superior performance on images with extreme aspect ratios and excels in fine-grained recognition tasks.	Current implementation is limited to a maximum resolution of 672x1008; future work will explore higher resolutions. Image slices are currently encoded independently; future research will focus on establishing connections between slices for enhanced global information interaction.	large multimodal models, visual encoding, high-resolution image understanding, image modularization, llava-uhd
2403.11697 Report	Urban Scene Diffusion through Semantic Occupancy Map	Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou	Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.	This paper proposes Urban Scene Diffusion through Semantic Occupancy Map (UrbanDiff), a novel 3D diffusion model for generating unbounded 3D urban scenes using semantic occupancy maps, conditioned on Bird's-Eye View (BEV) maps.	Generating large-scale urban scenes with accurate geometry and semantics is crucial for applications like scene simulation and autonomous driving. Existing methods struggle to achieve this while preserving controllability and scalability.	UrbanDiff employs a 3D VQVAE to encode semantic occupancy maps into a latent space, where a BEV-conditioned diffusion model learns the data distribution. A scene extension module enables the generation of large-scale scenes by aggregating single-frame outputs while maintaining temporal consistency.	UrbanDiff generates diverse and realistic urban scenes from real-world and simulator-generated BEV maps. Quantitative evaluation demonstrates superior performance over baseline methods in terms of V-FID, MMD, and human evaluation. The generated scenes benefit downstream tasks like point cloud segmentation and can be used as a prior for scene image synthesis with promising results.	The visual quality of synthesized scene images can be further improved. Future work will focus on incorporating object instance information for enhanced realism.	3d scene generation, diffusion models, semantic occupancy maps, "birds-eye view", urban scene synthesis
2403.11679 Report	NEDS-SLAM: A Novel Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting	Yiming Ji, Yang Liu, Guanghu Xie, Boyu Ma, Zongwu Xie	We propose NEDS-SLAM, an Explicit Dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier GS points, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.	Proposes NEDS-SLAM, an explicit dense semantic SLAM system using 3D Gaussian Splatting for robust 3D semantic mapping, camera tracking, and real-time rendering.	Addresses limitations in existing semantic SLAM methods that rely on accurate semantic pre-segmentation and suffer from inconsistent semantic feature estimation.	Combines semantic and appearance features with a fusion module for spatial consistency, compresses semantic features with an encoder-decoder, and employs a virtual camera view pruning method to remove noisy Gaussians.	Achieves competitive mapping and tracking accuracy compared to existing dense semantic SLAM methods on Replica and ScanNet datasets. Demonstrates robust semantic reconstruction by mitigating the impact of inconsistent semantic features from pre-trained models. Improves scene representation quality by effectively eliminating outlier Gaussian points through the virtual view pruning method.	Virtual view pruning increases computational load and may affect real-time performance. Future work includes optimizing the virtual view method and extending semantic reconstruction to dynamic scenes.	3d gaussian splatting, dense semantic mapping, neural slam, 3d reconstruction, semantic feature fusion
2403.11627 Report	LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models	Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, Wei Liu	Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a Low-Rank Adaptations (LoRA) fusion matrix of multiple LoRA to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, which occurs when the model cannot preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through Concept Injection Constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, Concept Isolation Constraints are introduced, refining the self-attention computation. Furthermore, Latent Re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at https://github.com/Young98CN/LoRA\_Composer.	LoRA-Composer, a training-free framework for multi-concept image customization by seamlessly integrating multiple pre-trained concepts encoded as LoRAs.	Existing multi-concept customization methods face challenges like concept confusion and concept vanishing, particularly without relying on image-based conditions like sketches or poses. LoRA-Composer addresses these limitations, offering more flexibility and accuracy.	LoRA-Composer introduces a novel LoRA-Composer Block within the Stable Diffusion U-Net. It employs Concept Injection Constraints with Region-Aware LoRA Injection and Concept Enhancement to mitigate concept vanishing. It utilizes Concept Isolation Constraints with a concept region mask and Region Perceptual Restriction to address concept confusion. Finally, Latent Re-initialization enhances layout generation by refining the latent space.	Outperforms baselines in image similarity across anime and realistic styles, demonstrating effective concept representation. Exhibits robustness even without image-based conditions, unlike methods like Mix-of-Show. User study confirms preference for LoRA-Composer, especially for its text-to-image and image-to-image alignment accuracy.	Concept boundaries can disappear when concepts are too close due to down-sampling. Foreground pixels might exceed layout boundaries due to Stable Diffusion's inherent design. Future work will focus on refining the attention mechanism and optimizing inference efficiency.	multi-concept customization, lora integration, training-free, controllable generation, diffusion models
2403.11589 Report	UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling	Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan	Reconstructing photo-realistic drivable human avatars from multi-view image sequences has been a popular and challenging topic in the field of computer vision and graphics. While existing NeRF-based methods can achieve high-quality novel view rendering of human models, both training and inference processes are time-consuming. Recent approaches have utilized 3D Gaussians to represent the human body, enabling faster training and rendering. However, they undermine the importance of the mesh guidance and directly predict Gaussians in 3D space with coarse mesh guidance. This hinders the learning procedure of the Gaussians and tends to produce blurry textures. Therefore, we propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures. We utilize the embedding of UV map to learn Gaussian textures in 2D space, leveraging the capabilities of powerful 2D networks to extract features. Additionally, through an independent Mesh network, we optimize pose-dependent geometric deformations, thereby guiding Gaussian rendering and significantly enhancing rendering quality. We collect and process a new dataset of human motion, which includes multi-view images, scanned models, parametric model registration, and corresponding texture maps. Experimental results demonstrate that our method achieves state-of-the-art synthesis of novel view and novel pose. The code and data will be made available on the homepage https://alex-jyj.github.io/UV-Gaussians/ once the paper is accepted.	This paper introduces UV Gaussians, a novel method combining 3D Gaussian Splatting and mesh deformation to reconstruct photo-realistic and animatable human avatars from multi-view images.	Existing NeRF-based methods for human avatar modeling are computationally expensive, while recent 3D Gaussian-based methods overlook the importance of accurate mesh guidance for high-quality rendering.	UV Gaussians jointly learns pose-dependent mesh deformations using a Mesh U-Net and 2D UV-space Gaussian textures using a Gaussian U-Net. It then uses the refined mesh to guide the animation of 3D Gaussians for rendering.	Achieves state-of-the-art performance in novel view synthesis, outperforming NeRF-based and other 3DGS-based methods. Exhibits superior quality in novel pose synthesis, accurately capturing clothing wrinkles and texture details. Demonstrates the effectiveness of mesh guidance and UV space representation for high-fidelity human avatar modeling.	Reliance on scanned mesh data limits applicability to scenarios without such information. Limited evaluation on extremely loose clothing types like long skirts.	human modeling, neural rendering, gaussian splatting, 3d avatars, mesh deformation
2403.11568 Report	EffiVED:Efficient Video Editing via Text-instruction Diffusion Models	Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang	Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. The datasets will be made publicly available upon publication.	This paper introduces EffiVED, an efficient diffusion-based model for instruction-guided video editing that does not require per-video fine-tuning.	Current video editing methods are computationally expensive, often requiring per-video fine-tuning or inversion optimization.	The authors propose two workflows to generate a video editing dataset from: 1) image editing datasets using data augmentation to simulate camera movements, and 2) open-world videos using LLM (ChatGPT) to generate editing instructions and CoDeF to create edited videos. EffiVED is trained on this dataset using a 3D U-Net architecture with decoupled classifier-free guidance for text and video conditions.	EffiVED achieves comparable editing quality to state-of-the-art methods like CoDeF. EffiVED is significantly faster than previous methods, achieving a speedup of 6 to 28 times. The proposed data collection method effectively converts existing resources into a high-quality video editing dataset, addressing the data scarcity issue.	The quality of generated videos can be further improved, especially for complex motion editing. The model's ability to generalize to unseen editing instructions and video domains needs further exploration.	video editing, diffusion models, text-guided synthesis, data augmentation, large language models
2403.11535 Report	EchoReel: Enhancing Action Generation of Existing Video Diffusion Models	Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song	Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.	This paper introduces EchoReel, a novel framework that enhances the ability of existing Video Diffusion Models (VDMs) to generate complex human actions by leveraging readily available videos as references in an in-context learning approach.	Existing VDMs struggle to learn and synthesize a wide range of actions due to limitations in model scale and data diversity. EchoReel addresses this by enabling VDMs to learn and imitate intricate actions from reference videos, even those not encountered during training.	EchoReel consists of two main components: (1) Action Prism: Extracts motion-related features from reference videos using a transformer-based architecture with spatial and temporal self-attention and spatial cross-attention. (2) Action Integration: Integrates extracted motion features into the VDM through newly added temporal cross-attention layers, guiding action generation without altering pre-trained layers.	EchoReel significantly improves action generation quality in pre-trained VDMs, as evidenced by substantial reductions in FVD scores and improvements in text-visual alignment and frame consistency. The framework generalizes well to multiple reference videos and shows promising results in image-to-video generation tasks. Ablation studies confirm the importance of each component and design choice within EchoReel, highlighting the effectiveness of the proposed action extraction and integration mechanisms.	EchoReel currently faces limitations in improving the generation of objects involved in actions, particularly when the base VDM struggles with synthesizing those objects. Future work will focus on addressing this limitation by exploring methods to enhance the generation of both actions and related objects.	video generation, in-context learning, diffusion model, action recognition, motion imitation
2403.11503 Report	Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors	Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong	We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.	This paper introduces a novel single-image 3D-aware editing method that leverages pre-trained diffusion models, enabling 3D object manipulations (e.g., rotation, translation) on open-domain images without requiring specialized training datasets.	Existing 3D-aware editing techniques often rely on synthetic datasets, limiting their effectiveness on real-world images with diverse styles and layouts. This method addresses this limitation by utilizing the powerful generalization capabilities of large-scale, pre-trained image diffusion models.	The method employs an iterative algorithm with three phases: (1) View synthesis using depth-based warping and layered diffusion inpainting, (2) Undistortion to correct geometric imperfections using diffusion models as geometry critics, and (3) Shape alignment to refine object shapes using dense image correspondences.	The method generates high-quality 3D edits with large viewpoint transformations while maintaining appearance and shape consistency with the input image. It outperforms previous methods, including OBJect-3DIT and Zero123, in terms of layout plausibility, image quality, and appearance consistency, as demonstrated by visual comparisons and quantitative metrics. A user study confirms the superiority of the method, with participants significantly preferring its editing results over other approaches.	The method's ability to preserve extremely fine details is limited by the capabilities of the pre-trained diffusion models. Handling large transformations where minimal object regions are visible in the target view remains challenging, requiring further research to enhance robustness.	diffusion models, 3d-aware image editing, tuning-free editing, novel view synthesis, geometry correction
2403.11481 Report	VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li	We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.	Proposes VideoAgent, an LLM-powered multimodal tool-use agent for video understanding that leverages a novel unified memory mechanism.	Addresses the limitations of current end-to-end video-language models in handling long-form videos with complex temporal dependencies, which suffer from high computational cost and attention limitations.	Constructs a unified memory consisting of a temporal memory storing segment-level descriptions and an object memory tracking object states. It utilizes a minimal set of tools (caption retrieval, segment localization, visual question answering, object memory querying) to interact with this memory and solve tasks.	Achieves state-of-the-art performance on EgoSchema, outperforming baselines by up to 26% and approaching the accuracy of Gemini 1.5 Pro. Demonstrates strong performance on Ego4D NLQ, exceeding supervised baselines like 2D-TAN and VSLNet in a zero-shot setting. Outperforms other methods on NExT-QA, particularly excelling in causal questions that demand robust temporal reasoning, and shows significant improvement over using individual tools like Video-LLaVA alone.	Limited exploration of real-world applications. Potential for further investigation into incorporating more sophisticated tools and reasoning mechanisms.	video understanding, llms, tool-use, multimodal agents, unified memory
2403.11453 Report	Bridging 3D Gaussian and Mesh for Freeview Video Rendering	Yuting Xiao, Xuan Wang, Jiafei Li, Hongrui Cai, Yanbo Fan, Nan Xue, Minghui Yang, Yujun Shen, Shenghua Gao	This is only a preview version of GauMesh. Recently, primitive-based rendering has been proven to achieve convincing results in solving the problem of modeling and rendering the 3D dynamic scene from 2D images. Despite this, in the context of novel view synthesis, each type of primitive has its inherent defects in terms of representation ability. It is difficult to exploit the mesh to depict the fuzzy geometry. Meanwhile, the point-based splatting (e.g. the 3D Gaussian Splatting) method usually produces artifacts or blurry pixels in the area with smooth geometry and sharp textures. As a result, it is difficult, even not impossible, to represent the complex and dynamic scene with a single type of primitive. To this end, we propose a novel approach, GauMesh, to bridge the 3D Gaussian and Mesh for modeling and rendering the dynamic scenes. Given a sequence of tracked mesh as initialization, our goal is to simultaneously optimize the mesh geometry, color texture, opacity maps, a set of 3D Gaussians, and the deformation field. At a specific time, we perform $\alpha$-blending on the RGB and opacity values based on the merged and re-ordered z-buffers from mesh and 3D Gaussian rasterizations. This produces the final rendering, which is supervised by the ground-truth image. Experiments demonstrate that our approach adapts the appropriate type of primitives to represent the different parts of the dynamic scene and outperforms all the baseline methods in both quantitative and qualitative comparisons without losing render speed.	Presents GauMesh, a novel approach for freeview video rendering that bridges the strengths of 3D Gaussian splatting and triangle meshes in a hybrid representation.	Addresses limitations of using a single primitive type for representing complex dynamic scenes, aiming to leverage the advantages of each type for improved visual quality and rendering efficiency.	Employs a hybrid differentiable rendering pipeline that blends 3D Gaussians and textured meshes. Uses a grid-based deformation field for 3D Gaussians and a mesh tracking approach initialized from keyframes.	Achieves state-of-the-art performance on the Multiface dataset, demonstrating superior visual quality compared to baselines. Effectively reconstructs both complex geometry (e.g., hair) and fine color details on smooth surfaces (e.g., facial features). Maintains fast rendering capabilities due to the use of rasterization-based rendering for both 3D Gaussians and meshes.	Could explore more advanced mesh deformation techniques beyond simple vertex translation. Further investigate compression methods for the deformation field to improve storage efficiency.	freeview video, primitive-based rendering, novel view synthesis, 3d gaussian splatting, hybrid representation
2403.11451 Report	CasSR: Activating Image Power for Real-World Image Super-Resolution	Haolan Chen, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Wei Hu	The objective of image super-resolution is to generate clean and high-resolution images from degraded versions. Recent advancements in diffusion modeling have led to the emergence of various image super-resolution techniques that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the prevalent severe degradation in low-resolution images and the inherent characteristics of diffusion models, achieving high-fidelity image restoration remains challenging. Existing methods often exhibit issues including semantic loss, artifacts, and the introduction of spurious content not present in the original image. To tackle this challenge, we propose Cascaded diffusion for Super-Resolution, CasSR , a novel method designed to produce highly detailed and realistic images. In particular, we develop a cascaded controllable diffusion model that aims to optimize the extraction of information from low-resolution images. This model generates a preliminary reference image to facilitate initial information extraction and degradation mitigation. Furthermore, we propose a multi-attention mechanism to enhance the T2I model's capability in maximizing the restoration of the original image content. Through a comprehensive blend of qualitative and quantitative analyses, we substantiate the efficacy and superiority of our approach.	This paper introduces CasSR, a novel cascaded diffusion model designed for real-world image super-resolution, emphasizing image guidance over semantic information for enhanced fidelity and detail.	Existing diffusion-based super-resolution methods often struggle with semantic loss, artifacts, and spurious content, particularly when handling severely degraded images. CasSR addresses these limitations by maximizing the extraction and utilization of information from the low-resolution input itself.	CasSR employs a two-stage approach: (1) an image activation module (e.g., SCEdit) enhances the input image, generating a reference image with reduced degradation. (2) a multiple attention module integrates information from both the original and enhanced images, guiding a pre-trained Stable Diffusion model for high-fidelity restoration.	CasSR consistently outperforms or achieves competitive results against state-of-the-art methods on both real-world and synthetic benchmarks. The method excels in perceptual metrics (MUSIQ, MANIQA), indicating superior image quality and detail restoration. Ablation studies highlight the effectiveness of the image activation and multiple attention modules, demonstrating the importance of image guidance over relying solely on semantic information (text prompts).	The performance of CasSR may be slightly impacted when input images are cropped, resulting in information loss. Future work could explore alternative image activation techniques for even richer reference image generation.	image super-resolution, diffusion models, image restoration, text-to-image models, image activation
2403.11447 Report	Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction	Zhiyang Guo, Wengang Zhou, Li Li, Min Wang, Houqiang Li	3D Gaussian Splatting (3DGS) has become an emerging tool for dynamic scene reconstruction. However, existing methods focus mainly on extending static 3DGS into a time-variant representation, while overlooking the rich motion information carried by 2D observations, thus suffering from performance degradation and model redundancy. To address the above problem, we propose a novel motion-aware enhancement framework for dynamic scene reconstruction, which mines useful motion cues from optical flow to improve different paradigms of dynamic 3DGS. Specifically, we first establish a correspondence between 3D Gaussian movements and pixel-level flow. Then a novel flow augmentation method is introduced with additional insights into uncertainty and loss collaboration. Moreover, for the prevalent deformation-based paradigm that presents a harder optimization problem, a transient-aware deformation auxiliary module is proposed. We conduct extensive experiments on both multi-view and monocular scenes to verify the merits of our work. Compared with the baselines, our method shows significant superiority in both rendering quality and efficiency.	This paper introduces a motion-aware enhancement framework for dynamic 3D Gaussian Splatting, improving reconstruction quality and efficiency by leveraging optical flow priors.	Existing dynamic 3DGS methods often overlook rich motion information in 2D sequences, leading to performance degradation and model redundancy.	The framework establishes a cross-dimensional correspondence between 3D Gaussian movements and pixel-level optical flow. It features uncertainty-aware flow augmentation and a transient-aware deformation auxiliary module for enhanced optimization.	The method outperforms baselines in multi-view and monocular dynamic scene benchmarks, achieving higher PSNR, SSIM, and lower LPIPS. Motion-aware regularization reduces Gaussian and motion redundancy, enabling more efficient dynamic modeling, especially in monocular settings. The framework exhibits robustness under sparser viewpoints for multi-view scenarios, demonstrating potential for wider application.	Motion blur remains a challenge as the model might overfit blurred regions, impacting temporal consistency. Exploring additional priors beyond optical flow could further mitigate motion uncertainty, particularly in monocular scenes.	3d gaussian splatting, dynamic scene reconstruction, optical flow, motion awareness, neural rendering
2403.11423 Report	VmambaIR: Visual State Space Model for Image Restoration	Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, Wenming Yang	Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.	This paper introduces VmambaIR, a novel image restoration network leveraging state space models (SSMs) with linear complexity for tasks like image deraining and super-resolution.	Existing methods like CNNs, GANs, and Transformers face limitations in capturing long-range dependencies, high computational costs, or quadratic complexity. SSMs offer a promising alternative with linear complexity and efficient high-frequency modeling capabilities.	VmambaIR incorporates a Unet architecture with Omni Selective Scan (OSS) blocks. The OSS block consists of an OSS module for comprehensive information flow modeling from six directions and an Efficient Feed-Forward Network (EFFN) for information flow regulation across hierarchical levels.	VmambaIR achieves state-of-the-art performance on single image super-resolution, outperforming existing methods in both PSNR and LPIPS metrics. In real-world image super-resolution, VmambaIR achieves superior results with only 26% of the computational cost compared to previous SOTA methods. VmambaIR demonstrates superior performance in image deraining, exceeding previous methods in PSNR and SSIM while maintaining lower complexity.	The current design of selective scan operations in OSS involves significant data type and dimension conversions, leading to slower speeds compared to vanilla convolution despite similar computational complexity. Future work includes exploring the application of VmambaIR to video processing and other low-level vision tasks.	state space models, image restoration, super-resolution, image deraining, omni selective scan
2403.11415 Report	DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation	Jeongsol Kim, Geon Yeong Park, Jong Chul Ye	Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.	DreamSampler is a novel framework for image manipulation that unifies diffusion sampling and score distillation via regularized latent optimization.	Reverse diffusion sampling often requires architectural adjustments or feature engineering. Score distillation, while model-agnostic, is prone to mode collapse. DreamSampler addresses these limitations, leveraging the strengths of both approaches.	DreamSampler interprets latent optimization during reverse diffusion as a proximal update, allowing integration of regularization terms. It shows that the proximal update loss can be conceptualized as the score distillation loss, enabling their unification.	DreamSampler enables novel applications like image vectorization from blurry input with semantic text guidance, outperforming multi-stage baseline approaches. For real image editing, DreamSampler effectively modifies images according to text prompts while preserving image fidelity and outperforming or being on par with existing methods. In text-guided image inpainting, DreamSampler generates semantically consistent content within masked regions while maintaining high fidelity to the original image, surpassing baseline methods in reconstruction quality.	DreamSampler's performance is reliant on the quality of the pre-trained diffusion model. Further exploration of time scheduling and regularization functions could improve DreamSampler's efficacy.	latent diffusion model, image manipulation, score distillation, reverse diffusion sampling, image generation
2403.11401 Report	Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning	Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong	This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.	\methodname{} is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs) with both egocentric and scene-level 3D information.	Existing visual-language models often struggle to handle persistent 3D spatial information and scene changes in interactive environments, limiting their effectiveness in tasks like indoor planning.	The model employs a hybrid 3D visual feature representation, integrating both egocentric and scene-level information. It uses a projection layer to align these features with pre-trained textual embeddings. A two-stage training strategy first aligns conceptual features and then fine-tunes with instructional following annotations.	\methodname{} achieves state-of-the-art results on ScanQA and SQA3D benchmarks for 3D visual question answering, demonstrating strong 3D scene understanding and reasoning. The model effectively handles scene changes and performs well on the Alfred benchmark for interactive planning, highlighting its ability in dynamic environments. Ablation studies show the effectiveness of the hybrid representation, the importance of egocentric and scene-level updates, and the benefit of using frame data for concept alignment.	Current limitations include a dependence on the maximum token length of the LLM, posing challenges for processing high-resolution 3D scenes. The model currently lacks an explicit state detection mechanism for complex dynamic scenes, potentially hindering performance in such environments.	3d visual language model, interactive planning, egocentric and scene-level understanding, hybrid 3d feature representation, large language models
2403.11324 Report	GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering	Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari	During the Gaussian Splatting optimization process, the scene's geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue, we propose a novel approach called GeoGaussian. Based on the smoothly connected areas observed from point clouds, this method introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, where the characteristic can be transferred to new generations through a carefully designed densification strategy. Finally, the pipeline ensures that the scene's geometry and texture are maintained through constrained optimization processes with explicit geometry constraints. Benefiting from the proposed architecture, the generative ability of 3D Gaussians is enhanced, especially in structured regions. Our proposed pipeline achieves state-of-the-art performance in novel view synthesis and geometric reconstruction, as evaluated qualitatively and quantitatively on public datasets.	GeoGaussian, a novel geometry-aware Gaussian Splatting method for enhancing 3D scene representation and novel view synthesis, especially in low-textured regions.	Gaussian Splatting methods often prioritize image clarity over geometric fidelity, leading to degradation in rendering performance for novel views, particularly in non-textured areas.	The method leverages thin ellipsoid Gaussian parameterization initialized based on surface normals, employs a constrained densification strategy to ensure new Gaussians align with smooth surfaces, and introduces a geometrically consistent loss function during optimization.	GeoGaussian achieves state-of-the-art performance in novel view synthesis, outperforming 3DGS and LightGS on Replica and TUM RGB-D datasets, especially in sparse view scenarios. The method demonstrates improved geometry accuracy compared to 3DGS, as evidenced by better alignment of point clouds with ground truth mesh models. GeoGaussian shows faster convergence and enhanced robustness during training due to accurate initialization and constrained densification strategies.	Reliance on accurate point cloud normals for initialization. Limited performance in non-structured scenes where accurate normal estimation is challenging.	gaussian splatting, novel view synthesis, 3d reconstruction, geometry-aware densification, thin ellipsoid gaussian
2403.11262 Report	Understanding Diffusion Models by Feynman's Path Integral	Yuji Hirono, Akinori Tanaka, Kenji Fukushima	Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman's path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions.The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck's constant in quantum physics. This analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes.	This paper presents a novel formulation of diffusion models using Feynman's path integral, a framework originating from quantum physics. The formulation offers a unified perspective on various aspects of score-based generative models and provides a new method for scrutinizing the role of noise in the sampling process.	This formulation is important because it allows for a deeper understanding of diffusion models by connecting them to well-established techniques in quantum physics. It also provides a way to calculate the negative log-likelihood for stochastic sampling processes, which was previously elusive.	The authors apply path integral techniques to derive the time-reversed stochastic differential equations and loss functions for diffusion models. They introduce an interpolating parameter linking stochastic and deterministic sampling schemes and use the Wentzel–Kramers–Brillouin (WKB) expansion to evaluate the negative log-likelihood for stochastic processes.	The path integral formulation provides an alternative derivation of the time-reversed SDE. The interpolating parameter plays an analogous role to Planck's constant in quantum physics, and the limit of zero noise corresponds to the classical limit. The WKB expansion enables a perturbative evaluation of the negative log-likelihood, quantifying the impact of noise on the sampling process.	The current experiments do not include actual image data due to limitations in evaluating NLLs for high-dimensional data. The estimated numerical error in the computed NLLs might be underestimated.	diffusion models, path integral, wkb expansion, negative log-likelihood, stochastic sampling
2403.11247 Report	Compact 3D Gaussian Splatting For Dense Visual SLAM	Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Danwei Wang, Weidong Chen	Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.	This paper introduces a novel 3D Gaussian Splatting-based SLAM system that compresses scene representation to enhance speed, storage efficiency, and rendering while maintaining high-quality reconstruction.	Existing 3D Gaussian-based SLAM methods, while offering high-quality reconstruction, suffer from high memory and storage costs and slow training speeds due to a large number of redundant 3D Gaussian ellipsoids.	The proposed system employs a three-pronged approach: 1) a sliding window-based online masking method to remove redundant 3D Gaussian ellipsoids, 2) a codebook-based method to compress the geometric attributes of the remaining ellipsoids, and 3) a global bundle adjustment method with reprojection loss for accurate and robust camera pose estimation.	The system achieves a nearly 176% increase in rendering speed compared to existing GS-based SLAM methods. It achieves over 1.97x compression on memory usage compared to existing GS-based SLAM methods. The system maintains state-of-the-art quality of scene representation despite the significant reduction in the number of Gaussian ellipsoids.	The system's performance relies heavily on the quality of depth information, which might be limited in real-world scenarios with noisy or incomplete depth data. Future work could explore incorporating semantic information into the scene representation to further enhance the system's capabilities and performance in complex environments.	slam, 3d gaussian splatting, scene representation, compression, real-time rendering
2403.11207 Report	MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data	Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham	Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.	MindEye2 reconstructs visual perception from fMRI data using only one hour of training data per subject, achieving comparable quality to previous approaches that require dozens of hours.	This advancement holds the potential to revolutionize clinical assessment and brain-computer interfaces by enabling practical reconstruction of perception from minimal fMRI data.	The approach pretrains a shared-subject model on data from multiple subjects, then fine-tunes it on limited data from a new subject. It maps fMRI activity to a shared latent space using ridge regression, then to CLIP image space using an MLP backbone and diffusion prior. Finally, a fine-tuned Stable Diffusion XL model generates images from the predicted CLIP embeddings.	Achieves state-of-the-art performance on image retrieval and reconstruction metrics when trained on the full Natural Scenes Dataset. Maintains competitive decoding performance with only 2.5% of a subject's full dataset (one hour of scanning data). Outperforms previous methods in subjective human evaluations of reconstruction quality, even with limited training data.	fMRI's sensitivity to movement and task compliance can affect decoding accuracy. The model's current focus on natural scenes may require additional data or specialized models for other image distributions. Future work could explore expanding to other image types or real-time applications.	neuroai, fmri, computational neuroscience, visual perception, deep learning
2403.11197 Report	TAG: Guidance-free Open-Vocabulary Semantic Segmentation	Yasufumi Kawano, Yoshimitsu Aoki	Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive training. Furthermore, because supervised learning uses a limited set of predefined categories, models typically struggle with rare classes and cannot recognize new ones. Unsupervised and open-vocabulary segmentation, proposed to tackle these issues, faces challenges, including the inability to assign specific class labels to clusters and the necessity of user-provided text queries for guidance. In this context, we propose a novel approach, TAG which achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment images into meaningful categories without additional training or dense annotations. It retrieves class labels from an external database, providing flexibility to adapt to new scenarios. Our TAG achieves state-of-the-art results on PascalVOC, PascalContext and ADE20K for open-vocabulary segmentation without given class names, i.e. improvement of +15.3 mIoU on PascalVOC. All code and data will be released at https://github.com/Valkyrja3607/TAG.	TAG, a novel Training, Annotation, and Guidance-free method for open-vocabulary semantic segmentation, retrieves segment categories from an external database using CLIP and DINOv2.	Addresses limitations of traditional semantic segmentation methods: reliance on costly pixel-level annotations, predefined categories, and the need for user-provided text queries in open-vocabulary settings.	1. Identifies segment candidates using per-pixel features from DINOv2. 2. Obtains representative segment embeddings using CLIP's per-pixel features. 3. Assigns categories by retrieving closest matching sentences from an external database.	Achieves state-of-the-art results on PascalVOC, PascalContext, and ADE20K for open-vocabulary segmentation without given class names. Shows significant improvement (+15.3 mIoU) over previous zero-guidance segmentation methods on PascalVOC. Successfully segments and labels images containing general objects, specific categories like 'joker,' and proper nouns.	Performance depends on the choice of database, posing challenges for unknown domains. Doesn't differentiate between class granularity levels, potentially predicting a broader category than desired.	semantic segmentation, open-vocabulary segmentation, zero-guidance segmentation, clip, dinov2
2403.11194 Report	MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation	Yasufumi Kawano, Yoshimitsu Aoki	Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at https://github.com/Valkyrja3607/MaskDiffusion.	This paper introduces MaskDiffusion, a novel method leveraging pre-trained Stable Diffusion models for open-vocabulary semantic segmentation without additional training or annotation.	Semantic segmentation faces challenges such as annotation costs and limitations in predicting novel classes. MaskDiffusion addresses these issues by exploiting the rich semantic information embedded in diffusion models pre-trained on massive image-text datasets.	MaskDiffusion extracts internal features and cross-attention maps from a frozen Stable Diffusion model. It then calculates representative internal features for each category using a weighted average based on cross-attention map values. Finally, it assigns classes to pixels by measuring the cosine similarity between pixel-wise internal features and representative features.	MaskDiffusion outperforms previous state-of-the-art methods like MaskCLIP and GEM on datasets like Potsdam, Cityscapes, PascalVOC, and COCO-Stuff. The method demonstrates robust open-vocabulary segmentation capabilities, successfully segmenting challenging concepts, rare classes, and proper nouns. An unsupervised version, Unsupervised MaskDiffusion, utilizing spectral clustering on internal features, outperforms other unsupervised methods, including DiffSeg, on Cityscapes and COCO-Stuff datasets.	The cross-attention map in MaskDiffusion shows limitations in accurately assigning internal features to classes. The current method assumes prior knowledge of potential classes in the image.	semantic segmentation, open-vocabulary segmentation, diffusion models, stable diffusion, unsupervised learning
2403.11176 Report	Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment	Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini	No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.	This paper proposes QualiCLIP, a self-supervised and opinion-unaware No-Reference Image Quality Assessment (NR-IQA) method based on CLIP that does not require labeled Mean Opinion Scores (MOS).	Existing NR-IQA methods are limited by their reliance on expensive and scale-limiting MOS labels, hindering their applicability to real-world scenarios. This paper addresses this challenge by leveraging the capabilities of CLIP.	The method utilizes a quality-aware image-text alignment strategy. Pairs of pristine image crops are synthetically degraded with increasing levels of intensity. The CLIP image encoder is then fine-tuned to rank these degraded images based on their similarity to quality-related antonym text prompts, like 'Good photo' and 'Bad photo'.	QualiCLIP achieves state-of-the-art performance on multiple IQA datasets with authentic distortions, outperforming existing opinion-unaware methods. Despite not using MOS labels, QualiCLIP surpasses supervised methods in cross-dataset evaluations, demonstrating superior generalization ability for real-world applications. QualiCLIP exhibits improved robustness compared to other methods, as shown by gMAD competition results, and showcases enhanced explainability through gradCAM visualization.	The method relies on synthetic distortions during training, which may not fully represent the complexities of real-world image degradations. Future work could explore the application of QualiCLIP’s quality-aware image representations to improve CLIP-based semantic tasks such as image retrieval.	image quality assessment, clip, self-supervised learning, opinion-unaware, image-text alignment
2403.11162 Report	CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion	Xiaoyu Wu, Yang Hua, Chumeng Liang, Jiaru Zhang, Hao Wang, Tao Song, Haibing Guan	Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio.	This paper presents CGI-DM, a novel method for digital copyright authentication in few-shot image generation using diffusion models (DMs). CGI-DM leverages the differences between pre-trained and fine-tuned models to recover missing image details and authenticate copyright.	Few-shot image generation techniques, while powerful, raise concerns about copyright infringement. Existing methods struggle to provide robust visual evidence for legal action. This work addresses this by providing a robust and visual method for authenticating copyright in DM-generated images.	CGI-DM removes partial information from an image and then leverages the conceptual differences between pre-trained and fine-tuned DMs to recover the missing details. It maximizes the KL divergence between the latent variable distributions of the two models through Monte Carlo sampling and Projected Gradient Descent (PGD).	CGI-DM achieves high accuracy in distinguishing between images used for training and those not used, outperforming existing image generation and inpainting pipelines. The method is robust across different DM architectures, training image numbers, and training steps. CGI-DM remains effective even under various defense mechanisms, demonstrating its resilience against attempts to mask training data.	The computational cost of CGI-DM increases with the number of Monte Carlo sampling steps. Future work could explore combining CGI-DM with data watermarking techniques to create a more comprehensive copyright protection system.	diffusion models, copyright authentication, few-shot image generation, gradient inversion, digital copyright
2403.11116 Report	PhD: A Prompted Visual Hallucination Evaluation Dataset	Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li	The rapid growth of Large Language Models (LLMs) has driven the development of Large Vision-Language Models (LVLMs). The challenge of hallucination, prevalent in LLMs, also emerges in LVLMs. However, most existing efforts mainly focus on object hallucination in LVLM, ignoring diverse types of LVLM hallucinations. In this study, we delve into the Intrinsic Vision-Language Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of IVL-Hallu on their causes and reflections. Specifically, we propose several novel IVL-Hallu tasks and categorize them into four types: (a) object hallucination, which arises from the misidentification of objects, (b) attribute hallucination, which is caused by the misidentification of attributes, (c) multi-modal conflicting hallucination, which derives from the contradictions between textual and visual information, and (d) counter-common-sense hallucination, which owes to the contradictions between the LVLM knowledge and actual images. Based on these taxonomies, we propose a more challenging benchmark named PhD to evaluate and explore IVL-Hallu. An automated pipeline is proposed for generating different types of IVL-Hallu data. Extensive experiments on five SOTA LVLMs reveal their inability to effectively tackle our proposed IVL-Hallu tasks, with detailed analyses and insights on the origins and possible solutions of these new challenging IVL-Hallu tasks, facilitating future researches on IVL-Hallu and LVLM. The benchmark can be accessed at https://github.com/jiazhen-code/IntrinsicHallu	This paper introduces Intrinsic Vision-Language Hallucination (IVLH) and proposes a new benchmark called PHD to evaluate and analyze it in Large Vision-Language Models (LVLMs).	Hallucination, a significant issue in LLMs, also affects LVLMs, and existing research primarily focuses on object hallucination. This work aims to comprehensively analyze diverse types of IVLH and their causes.	The study categorizes IVLH into four types: object, attribute, multi-modal conflicting, and counter-common-sense hallucinations. It proposes PHD, a benchmark with over 53,000 questions across these categories, and an automated data generation pipeline.	LVLMs struggle with identifying non-existent objects and mismatched attributes due to over-reliance on internal knowledge. Absurd questions and misaligned text/image information expose the susceptibility of LVLMs to multi-modal conflicts, leading to hallucinations. Counter-common-sense images reveal the fundamental challenge of LVLMs balancing internal knowledge with actual image content.	The benchmark primarily focuses on intrinsic hallucinations, leaving extrinsic hallucinations for future exploration. Addressing IVLH necessitates structural enhancements to LVLMs, balancing multi-modal inputs and internal knowledge with image content.	large vision-language models, hallucination, benchmarking, multi-modal learning, vision and language
2403.11111 Report	3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models	Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen	In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.	This paper introduces HumanWild, an automatic and scalable pipeline for synthesizing realistic human images with 3D annotations using generative models, aiming to address the limitations of existing mocap and CG-based datasets in providing diverse and in-the-wild data for 3D human pose and shape estimation (HPS).	Existing datasets for HPS, based on either indoor motion capture or computer graphics rendering, lack diversity in human identities and real-world scenes, hindering model generalization to in-the-wild scenarios.	The pipeline leverages SMPL-X for human body parameterization, renders surface normal maps, uses ControlNet with tailored text prompts to generate images, and filters noisy labels using a pre-trained segmentation model (SAM).	HumanWild effectively complements CG-rendered datasets, leading to improved performance on diverse HPS benchmarks. The pipeline generates a large-scale dataset of 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. Analysis suggests that synthetic data generated by generative models, like HumanWild, is beneficial for HPS tasks due to its diversity and realism.	Limitations in current diffusion models affect the accuracy of hand and facial annotations. Future work involves exploring the pipeline's application to other 3D perception tasks, such as 3D animal pose estimation and human interaction reconstruction.	synthetic data generation, 3d human pose and shape estimation, diffusion models, controllable image generation, computer vision
2403.11105 Report	Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models	Ruibin Li, Ruihuang Li, Song Guo, Lei Zhang	Text-driven diffusion models have significantly advanced the image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code be independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant decrease in editing artifacts. In addition to text-driven image editing, with SPDInv we can easily adapt customized image generation models to localized editing tasks and produce promising performance. The source code are available at https://github.com/leeruibin/SPDInv.	This paper proposes SPDInv, a novel image inversion method for text-driven image editing that disentangles the inverted latent noise code from the source prompt, thereby enhancing editing performance by reducing artifacts and inconsistencies.	Existing text-driven image editing methods rely on inversion techniques that tightly couple the inverted latent code with the source prompt, hindering editing flexibility and fidelity.	SPDInv leverages the fixed-point constraint inherent in the DDIM sampling process. It reformulates the constraint as a loss function and utilizes pre-trained diffusion models to search for a fixed-point solution, minimizing the influence of the source prompt on the inverted noise.	SPDInv effectively reduces the noise gap compared to DDIM inversion, indicating less entanglement with the source prompt. Quantitative evaluations on PIE-Bench and TDE-Bench datasets demonstrate significant improvements in editing quality over state-of-the-art methods. SPDInv successfully extends the capabilities of customized image generation methods, allowing for localized editing while preserving object identity and background consistency.	SPDInv relies on existing editing engines like P2P, PNP, and MasaCtrl, inheriting their limitations in handling complex editing operations such as adding or dropping content. While promising for various objects, SPDInv faces challenges in portrait editing, requiring further investigation.	image editing, image inversion, diffusion models, text-driven editing, latent space manipulation
2403.11056 Report	Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration	Zhihao Liang, Qi Zhang, Wenbo Hu, Ying Feng, Lei Zhu, Kui Jia	The 3D Gaussian Splatting (3DGS) gained its popularity recently by combining the advantages of both primitive-based and volumetric 3D representations, resulting in improved quality and efficiency for 3D scene rendering. However, 3DGS is not alias-free, and its rendering at varying resolutions could produce severe blurring or jaggies. This is because 3DGS treats each pixel as an isolated, single point rather than as an area, causing insensitivity to changes in the footprints of pixels. Consequently, this discrete sampling scheme inevitably results in aliasing, owing to the restricted sampling bandwidth. In this paper, we derive an analytical solution to address this issue. More specifically, we use a conditioned logistic function as the analytic approximation of the cumulative distribution function (CDF) in a one-dimensional Gaussian signal and calculate the Gaussian integral by subtracting the CDFs. We then introduce this approximation in the two-dimensional pixel shading, and present Analytic-Splatting, which analytically approximates the Gaussian integral within the 2D-pixel window area to better capture the intensity response of each pixel. Moreover, we use the approximated response of the pixel window integral area to participate in the transmittance calculation of volume rendering, making Analytic-Splatting sensitive to the changes in pixel footprint at different resolutions. Experiments on various datasets validate that our approach has better anti-aliasing capability that gives more details and better fidelity.	This paper introduces Analytic-Splatting, a novel approach for anti-aliasing in 3D Gaussian Splatting (3DGS) using an analytical approximation of the Gaussian integral within the pixel window area.	3DGS suffers from aliasing artifacts due to its discrete sampling scheme, especially when pixel footprints change drastically at different resolutions. This leads to blurry or jagged renderings. Analytic-Splatting aims to overcome these limitations by considering the entire pixel area for intensity response.	The method utilizes a conditioned logistic function to approximate the cumulative distribution function (CDF) of a one-dimensional Gaussian signal. This approximation is then extended to two dimensions for pixel shading by diagonalizing the covariance matrix and rotating the integration domain to decouple correlations.	Analytic-Splatting demonstrates superior anti-aliasing capabilities compared to 3DGS and other methods, producing renderings with better detail fidelity. The proposed analytic approximation significantly reduces errors compared to discrete sampling and prefiltering techniques. Experiments on multi-scale Blender Synthetic and Mip-NeRF 360 datasets validate the effectiveness of Analytic-Splatting in achieving state-of-the-art novel view synthesis results under multi-scale and super-resolution settings.	The increased number of root and exponential operations in the shading module slightly reduces rendering speed compared to 3DGS and Mip-Splatting. Future work could explore more efficient implementations and applications of the analytic approximation in other areas of neural rendering.	3d gaussian splatting, anti-aliasing, view synthesis, cumulative distribution function (cdf), analytic approximation
2403.11053 Report	OSTAF: A One-Shot Tuning Method for Improved Attribute-Focused T2I Personalization	Ye Wang, Zili Yi, Rui Ma	Personalized text-to-image (T2I) models not only produce lifelike and varied visuals but also allow users to tailor the images to fit their personal taste. These personalization techniques can grasp the essence of a concept through a collection of images, or adjust a pre-trained text-to-image model with a specific image input for subject-driven or attribute-aware guidance. Yet, accurately capturing the distinct visual attributes of an individual image poses a challenge for these methods. To address this issue, we introduce OSTAF, a novel parameter-efficient one-shot fine-tuning method which only utilizes one reference image for T2I personalization. A novel hypernetwork-powered attribute-focused fine-tuning mechanism is employed to achieve the precise learning of various attribute features (e.g., appearance, shape or drawing style) from the reference image. Comparing to existing image customization methods, our method shows significant superiority in attribute identification and application, as well as achieves a good balance between efficiency and output quality.	This paper introduces OSTAF, a one-shot fine-tuning method for attribute-focused text-to-image personalization using only one reference image.	Current personalized T2I models struggle to accurately separate and replicate distinct visual attributes from a single image, limiting attribute-focused customization.	The method analyzes how different parts of the diffusion U-net learn attributes and uses a lightweight hypernetwork to guide the fine-tuning of specific U-net components based on the desired attribute (appearance, shape, or style).	OSTAF outperforms existing methods in quantitative metrics like CLIP-T, IoU, and Gram matrix distance, demonstrating superior attribute customization. Qualitative results showcase OSTAF's ability to accurately identify and apply attributes across domains while maintaining text controllability. User studies confirm that OSTAF generates customized images that better align with user preferences compared to other methods.	While efficient in terms of data, fine-tuning time is comparable to other computationally intensive methods and could be improved. The method is currently limited to image inputs and could be expanded to video for more dynamic attribute customization.	text-to-image synthesis, image personalization, attribute customization, one-shot learning, hypernetworks
2403.11027 Report	Reward Guided Latent Consistency Distillation	Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang	Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM.	This document outlines the formatting instructions for authors submitting papers to the NeurIPS 2023 conference.	It ensures a consistent and standardized format for all submissions, aiding in the review process.	The paper provides detailed specifications regarding style, layout, fonts, citations, figures, tables, and other formatting aspects.	Authors must use the provided NeurIPS 2023 LaTeX style file. Submissions are limited to nine pages, excluding references and acknowledgments. Papers should be submitted in US Letter size with embedded Type 1 or TrueType fonts.	The document assumes familiarity with LaTeX. Specific guidance on handling supplementary materials could be clearer.	neurips, conference, formatting, latex, submission
2403.10983 Report	OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models	Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, Wenhan Luo	Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foreground and background. In this work, we propose OMG, an occlusion-friendly personalized generation framework designed to seamlessly integrate multiple concepts within a single image. We propose a novel two-stage sampling solution. The first stage takes charge of layout generation and visual comprehension information collection for handling occlusions. The second one utilizes the acquired visual comprehension information and the designed noise blending to integrate multiple concepts while considering occlusions. We also observe that the initiation denoising timestep for noise blending is the key to identity preservation and layout. Moreover, our method can be combined with various single-concept models, such as LoRA and InstantID without additional tuning. Especially, LoRA models on civitai.com can be exploited directly. Extensive experiments demonstrate that OMG exhibits superior performance in multi-concept personalization.	This supplementary material provides additional details, results, and analysis for the main paper on multi-concept image customization with layout and identity preservation.	Provides further insight into the evaluation setting, qualitative results, combination with other techniques like ControlNet and style LoRAs, and limitations of the proposed method.	Presents additional qualitative results, examples, and ablation studies to support claims made in the main paper.	Combining the method with ControlNet under various conditions (human pose, canny edge, depth maps) demonstrates its versatility. The method effectively combines with different style LoRAs, showcasing its flexibility in style manipulation. Layout preservation is crucial for maintaining image structure and quality during multi-concept customization.	Generating high-quality small-face regions can be challenging due to information loss in VAE. Computational intensity, particularly with noise fusion from multiple single-concept models, can lead to slower generation speed.	image customization, multi-concept generation, layout preservation, identity preservation, controlnet
2403.10953 Report	Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription	Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma	Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.	Introduces Ctrl123, a closed-loop transcription-based novel view synthesis diffusion model, to improve the pose and appearance consistency of generated views.	Existing diffusion-based NVS methods struggle to generate views consistent with ground truth poses and appearances, limiting performance in tasks like 3D reconstruction.	Extends open-loop NVS models to a closed-loop framework, measuring and minimizing the difference between generated and ground truth views in a pose-sensitive latent space using patch features.	Significantly improves NVS pose and appearance consistency even with fewer training steps. Achieves a 7 point increase in PSNR and substantial improvements in AA and IoU metrics compared to Zero123. Demonstrates superior 3D reconstruction quality with smooth surfaces and detailed geometry.	Exploring alternative latent space representations for enhanced consistency. Investigating the generalization of the closed-loop framework for ensuring consistency in other attributes like object relations and shapes.	novel view synthesis, diffusion models, closed-loop transcription, pose consistency, 3d reconstruction
2403.10935 Report	Understanding Robustness of Visual State Space Models for Image Classification	Chengbin Du, Yanxi Li, Chang Xu	Visual State Space Model (VMamba) has recently emerged as a promising architecture, exhibiting remarkable performance in various computer vision tasks. However, its robustness has not yet been thoroughly studied. In this paper, we delve into the robustness of this architecture through comprehensive investigations from multiple perspectives. Firstly, we investigate its robustness to adversarial attacks, employing both whole-image and patch-specific adversarial attacks. Results demonstrate superior adversarial robustness compared to Transformer architectures while revealing scalability weaknesses. Secondly, the general robustness of VMamba is assessed against diverse scenarios, including natural adversarial examples, out-of-distribution data, and common corruptions. VMamba exhibits exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions. Additionally, we explore VMamba's gradients and back-propagation during white-box attacks, uncovering unique vulnerabilities and defensive capabilities of its novel components. Lastly, the sensitivity of VMamba to image structure variations is examined, highlighting vulnerabilities associated with the distribution of disturbance areas and spatial information, with increased susceptibility closer to the image center. Through these comprehensive studies, we contribute to a deeper understanding of VMamba's robustness, providing valuable insights for refining and advancing the capabilities of deep neural networks in computer vision applications.	This paper presents a comprehensive analysis of the robustness of the Visual State Space Model (VMamba), a promising architecture for visual representation learning.	Despite its successes in various computer vision tasks, the robustness of VMamba, a novel architecture, has not been thoroughly studied.	The authors investigate VMamba's robustness to adversarial attacks (both whole-image and patch-specific), its performance on various ImageNet datasets (A, R, and C), the behavior of its novel components (parameters A, B, C, and Δ) under white-box attacks, and its sensitivity to image structure variations.	VMamba exhibits superior adversarial robustness compared to Transformer architectures but shows scalability weaknesses. VMamba demonstrates exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions. VMamba is highly sensitive to the spatial information and continuity of images, with increased susceptibility closer to the image center.	The study primarily focuses on a limited set of VMamba and Transformer models, which may not fully represent the entire spectrum of model variations. Future work can explore the development of specialized defense mechanisms tailored to the unique characteristics of VMamba's architecture, such as adaptive scanning strategies and robust feature extraction techniques.	visual state space model, vmamba, robustness, adversarial attacks, image classification
2403.10906 Report	HourglassNeRF: Casting an Hourglass as a Bundle of Rays for Few-shot Neural Rendering	Seunghyeon Seo, Yeonjin Chang, Jayeon Yoo, Seungwoo Lee, Hojun Lee, Nojun Kwak	Recent advancements in the Neural Radiance Field (NeRF) have bolstered its capabilities for novel view synthesis, yet its reliance on dense multi-view training images poses a practical challenge. Addressing this, we propose HourglassNeRF, an effective regularization-based approach with a novel hourglass casting strategy. Our proposed hourglass is conceptualized as a bundle of additional rays within the area between the original input ray and its corresponding reflection ray, by featurizing the conical frustum via Integrated Positional Encoding (IPE). This design expands the coverage of unseen views and enables an adaptive high-frequency regularization based on target pixel photo-consistency. Furthermore, we propose luminance consistency regularization based on the Lambertian assumption, which is known to be effective for training a set of augmented rays under the few-shot setting. Leveraging the inherent property of a Lambertian surface, which retains consistent luminance irrespective of the viewing angle, we assume our proposed hourglass as a collection of flipped diffuse reflection rays and enhance the luminance consistency between the original input ray and its corresponding hourglass, resulting in more physically grounded training framework and performance improvement. Our HourglassNeRF outperforms its baseline and achieves competitive results on multiple benchmarks with sharply rendered fine details. The code will be available.	HourglassNeRF, a novel regularization-based method for few-shot neural rendering that employs an hourglass casting strategy.	Addresses the challenge of NeRF's reliance on dense multi-view training images by introducing a novel ray augmentation and regularization technique.	1. Casts an hourglass as a bundle of additional rays within the conical frustum, featurized using Integrated Positional Encoding (IPE). 2. Applies adaptive high-frequency regularization based on target pixel photo-consistency. 3. Introduces luminance consistency regularization based on the Lambertian assumption.	Outperforms baseline methods and achieves state-of-the-art results on Realistic Synthetic 360° dataset. Renders sharper fine details from earlier training stages compared to methods relying on fixed high-frequency masking. Demonstrates competitive performance on DTU dataset without relying on dataset-specific priors.	Limited consideration of surface properties, assuming all reflections as diffuse even on shiny surfaces. Future work could explore adaptive use of specular and diffuse reflections based on estimated surface texture.	neural radiance field, few-shot neural rendering, ray augmentation, hourglass casting, luminance consistency
2403.10854 Report	A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment	Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang	While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.	This paper presents a comprehensive study of prompting Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA), exploring different prompting strategies and their effectiveness in evaluating image quality.	This study is important because it investigates the potential of MLLMs to serve as powerful, flexible, interpretable, and text-driven models for IQA, a task that traditional IQA methods struggle with.	The authors systematically combine psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with NLP prompting strategies (standard, in-context, and chain-of-thought prompting) to create nine prompting systems. They also propose a difficult sample selection procedure to challenge the MLLMs.	The optimal prompting system varies between open-source and close-source MLLMs. Only the close-source GPT-4V provides reasonable IQA performance, but still struggles with fine-grained quality variations and multiple-image comparison. Chain-of-thought prompting consistently improves GPT-4V's performance across different testing protocols and visual attributes.	The textual responses from MLLMs were not quantitatively assessed. The study focuses on prompting and doesn't explore instruction tuning of MLLMs for enhanced IQA performance.	image quality assessment, multimodal large language models, prompt engineering, psychophysics, benchmarking
2403.10801 Report	Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples	Ziqi Zhou, Minghui Li, Wei Liu, Shengshan Hu, Yechao Zhang, Wei Wan, Lulu Xue, Leo Yu Zhang, Dezhong Yao, Hai Jin	With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.	Gen-AF, a novel genetic evolution-nurtured adversarial fine-tuning approach, enhances downstream model robustness against DAEs while preserving generalization ability.	Pre-trained encoders are vulnerable to DAEs, jeopardizing downstream tasks. Existing defenses are ineffective due to domain shift and encoder sensitivity.	Two-stage approach: 1) Genetic-driven dual-track adversarial fine-tuning with bilevel optimization and genetic regularization. 2) Evolutionary adaptability fine-tuning, targeting robust-redundant layers.	Gen-AF achieves high robust testing accuracy against five SOTA DAEs across ten SSL methods, two pre-training datasets, and six downstream datasets. Maintains or improves generalization compared to standard training. Effectively defends against backdoor attacks targeting pre-trained encoders.	Exploration of other types of adversarial examples. Investigation of more efficient fine-tuning strategies to further reduce computational overhead.	adversarial machine learning, deep learning, self-supervised learning, transfer learning, adversarial examples
2403.10783 Report	StableGarment: Garment-Centric Generation via Stable Diffusion	Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, Peipei Li	In this paper, we introduce StableGarment, a unified framework to tackle garment-centric(GC) generation tasks, including GC text-to-image, controllable GC text-to-image, stylized GC text-to-image, and robust virtual try-on. The main challenge lies in retaining the intricate textures of the garment while maintaining the flexibility of pre-trained Stable Diffusion. Our solution involves the development of a garment encoder, a trainable copy of the denoising UNet equipped with additive self-attention (ASA) layers. These ASA layers are specifically devised to transfer detailed garment textures, also facilitating the integration of stylized base models for the creation of stylized images. Furthermore, the incorporation of a dedicated try-on ControlNet enables StableGarment to execute virtual try-on tasks with precision. We also build a novel data engine that produces high-quality synthesized data to preserve the model's ability to follow prompts. Extensive experiments demonstrate that our approach delivers state-of-the-art (SOTA) results among existing virtual try-on methods and exhibits high flexibility with broad potential applications in various garment-centric image generation.	Proposed StableGarment, a unified framework tackling various garment-centric generation tasks, including text-to-image, controllable generation, stylized generation, and robust virtual try-on.	Addresses limitations of existing virtual try-on methods and enables the creation of diverse product visuals (e.g., posters, display images) with accurate garment details and flexible image modifications.	Leverages a garment encoder with additive self-attention for detailed texture transfer, a try-on ControlNet for precise virtual try-on, and a data engine producing synthesized data to enhance prompt following.	Achieves state-of-the-art performance among virtual try-on methods. Demonstrates high flexibility in garment-centric image generation with various text prompts, control signals, and stylized base models. Outperforms existing methods in preserving intricate garment details, such as patterns and text.	Limitations in VAE reconstruction affecting garment detail preservation. Occasional generation of incorrect accessories due to inaccurate parsing conditions (garment masks, DensePose).	virtual try-on, text-to-image synthesis, diffusion models, garment-centric generation, stable diffusion
2403.10731 Report	Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation	Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden	Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the outpainted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.	This paper proposes a novel two-stage diffusion-based approach for human image generation that addresses the challenge of generating high-quality hands with precise pose control.	Existing diffusion models often struggle to generate realistic and anatomically correct hands, particularly when precise pose control is desired. This limitation hinders their applicability in areas like advertising and game character creation.	The proposed method first generates hands and their segmentation masks using a multi-task diffusion model. Then, it employs an adapted ControlNet model to outpaint the body around the generated hands, guided by the skeleton pose. A novel blending technique with sequential mask expansion ensures seamless integration of hands and body.	The method achieves state-of-the-art results in pose accuracy, outperforming baselines by a significant margin in terms of DAP and MPJPE for both full body and hand keypoints. Qualitative and quantitative evaluations on the HaGRID dataset demonstrate superior image quality with realistic and anatomically correct hands. The sequential mask expansion blending strategy effectively preserves hand details while ensuring seamless transitions between the generated regions, as shown in the ablation study.	The approach assumes connectivity between arms and wrists in the input pose, potentially leading to discontinuities if arm keypoints are missing. Generating high-quality small hands is challenging due to the limited resolution of the latent space.	human image generation, diffusion models, pose control, hand generation, multi-task learning
2403.10701 Report	IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation	Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga	Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.	Introduces IMPRINT, a two-stage diffusion-based generative model for object compositing that decouples identity preservation from compositing, enhancing object detail fidelity and background harmonization.	Addresses the limitations of existing generative object compositing methods that struggle to balance identity preservation with seamless integration into backgrounds.	Employs a two-stage learning framework: 1) context-agnostic, identity-preserving pretraining of an object encoder on multi-view data and 2) fine-tuning the model for compositing, leveraging the learned representations for harmonization.	Significantly outperforms existing methods in identity preservation and composition quality on benchmark datasets. Demonstrates superior appearance preservation through a novel context-agnostic training approach. Incorporates a shape-guidance mechanism for user-directed control over the compositing process.	Identity preservation may degrade with large viewpoint changes, requiring further exploration of 3D representations. Consistency of small details like text and logos can be improved, potentially through higher resolution encoders and improved latent space representation.	image compositing, generative models, diffusion models, identity preservation, shape guidance
2403.10615 Report	LightIt: Illumination Modeling and Control for Diffusion Models	Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, Yannick Hold-Geoffroy	We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.	Introduces LightIt, a method for explicit illumination control in image generation using single-bounce shading and normal maps as conditioning signals for diffusion models.	Recent generative methods lack explicit lighting control, which is crucial for artistic aspects of image generation like mood and realism.	Trains a shading estimation module to generate a paired image-shading dataset from panoramas, then trains a control network using estimated shading and normals to guide a pre-trained diffusion model (Stable Diffusion).	Generates images with controllable and consistent lighting across diverse text prompts and styles. Enables novel lighting scenarios for both real and generated images. Outperforms specialized relighting methods in terms of generalization and realism.	Assumes directional lighting, limiting applicability to outdoor scenes. Relies on estimated lighting directions, hindering training on larger, unconstrained datasets.	image generation, illumination control, diffusion models, shading estimation, relighting
2403.10520 Report	Strong and Controllable Blind Image Decomposition	Zeyu Zhang, Junlin Han, Chenhui Gou, Hongdong Li, Liang Zheng	Blind image decomposition aims to decompose all components present in an image, typically used to restore a multi-degraded input image. While fully recovering the clean image is appealing, in some scenarios, users might want to retain certain degradations, such as watermarks, for copyright protection. To address this need, we add controllability to the blind image decomposition process, allowing users to enter which types of degradation to remove or retain. We design an architecture named controllable blind image decomposition network. Inserted in the middle of U-Net structure, our method first decomposes the input feature maps and then recombines them according to user instructions. Advantageously, this functionality is implemented at minimal computational cost: decomposition and recombination are all parameter-free. Experimentally, our system excels in blind image decomposition tasks and can outputs partially or fully restored images that well reflect user intentions. Furthermore, we evaluate and configure different options for the network structure and loss functions. This, combined with the proposed decomposition-and-recombination method, yields an efficient and competitive system for blind image decomposition, compared with current state-of-the-art methods.	This paper introduces controllability to blind image decomposition (BID), enabling users to selectively remove or retain image components based on their needs.	It addresses the limitations of existing BID methods that lack controllability and flexibility in handling user-specific preferences for image restoration. This makes image processing more aligned with real-world scenarios where users may want to keep certain degradations, like watermarks for copyright.	The paper presents CBDNet, a U-Net-based architecture with a decomposition block, a controllability block, and a recombination block. The decomposition block splits the feature map into components. The controllability block predicts the components present and allows for user prompt input. The recombination block blends selected components based on the prompt.	CBDNet achieves state-of-the-art performance on standard BID tasks, outperforming existing methods in both efficiency and accuracy. CBDNet effectively performs controllable BID, removing or retaining components based on user prompts. The authors create a new multi-domain degradation removal dataset to support research on controllable BID with nine degradation types across weather, lighting, and obstruction domains.	CBDNet's in-painting capability has limitations, especially in heavily obscured areas. Further research is needed to improve the robustness of the source classifier in corner cases.	image decomposition, low-level vision, controllable image processing, image restoration, rain removal
2403.10427 Report	SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians	Hiba Dahmani, Moussab Bennehar, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou	Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.	SWAG, a novel 3D Gaussian Splatting (3DGS)-based method for 3D scene reconstruction from in-the-wild photo collections, effectively handling appearance variations and occluders.	Existing implicit neural representation methods for 3D scene reconstruction struggle with the high computational cost of volumetric rendering, particularly in challenging in-the-wild scenarios.	SWAG introduces image-dependent embeddings to modulate Gaussian colors, capturing appearance variations. It also learns image-dependent opacity variations for each Gaussian, allowing for unsupervised handling of transient objects.	SWAG achieves state-of-the-art results on the Phototourism dataset and NeRF-OSR benchmark. It significantly outperforms 3DGS in in-the-wild settings, with an average PSNR improvement of 5 dB. SWAG maintains real-time rendering capabilities while significantly reducing training time compared to implicit methods.	Transient object removal, while generally effective, can lead to minor artifacts in areas with frequent occlusions. Future work could explore per-scene hyperparameter tuning and extension to dynamic scenes.	3d gaussian splatting, unconstrained photo collection, novel view synthesis, appearance modeling, real-time rendering
2403.10395 Report	Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding	Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang	Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.	Isotropic3D is a novel image-to-3D generation pipeline that takes only an image CLIP embedding as input, allowing for isotropic optimization with respect to the azimuth angle using only the SDS loss, resulting in more symmetrical and neat 3D content.	Existing image-to-3D methods heavily rely on reference images, leading to issues like 3D distortion, multi-face problems, and multi-view inconsistency. This work aims to leverage the power of 2D diffusion models without compromising the generation process by hard image supervision.	Isotropic3D utilizes a two-stage fine-tuning of a text-to-3D diffusion model. Firstly, it substitutes the text encoder with an image encoder to enable image-to-image capabilities. Secondly, it introduces Explicit Multi-view Attention (EMA) to fine-tune the model using noisy multi-view images and a noise-free reference image, allowing the reference image to be discarded during the 3D generation stage.	Isotropic3D generates high-quality 3D models with rich color and well-proportioned geometry from a single image CLIP embedding. The method is robust to the object pose of the reference image. Generated 3D content exhibits a high degree of consistency with the reference image.	The resolution of the rendered 3D content is limited by the training data resolution. The model's performance on faces requires further improvement.	image-to-3d, clip embedding, multi-view attention, score distillation sampling, neural radiance fields
2403.10336 Report	How Powerful Potential of Attention on Image Restoration?	Cong Wang, Jinshan Pan, Yeying Jin, Liyan Wang, Wei Wang, Gang Fu, Wenqi Ren, Xiaochun Cao	Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.	This paper proposes Continuous Scaling Attention (CSAttn), a novel attention mechanism for image restoration that achieves competitive performance without relying on feed-forward networks (FFN) typically found in Transformer architectures.	Existing Transformer architectures heavily depend on FFN after the attention computation. This work challenges this norm by exploring the potential of solely using attention mechanisms for image restoration, aiming for a more efficient and potentially more effective solution.	The CSAttn block employs three consecutive attention computations, enhanced by several key designs: Continuous Attention Learning, Spatial Scaling Learning, Value Nonlinear Transformation Adjustment, Nonlinear Activation Function, Intra Attention Aggregation, Intra Progressive More Heads, and Intra Residual Connections. Each of these components contributes to scaling up the attention capacity for achieving superior performance.	CSAttn outperforms state-of-the-art approaches on image deraining, achieving an average PSNR improvement of 0.41 dB over the best competitor. CSAttn demonstrates superior performance on image desnowing, surpassing previous state-of-the-art methods on both CSD and Snow100K benchmarks. CSAttn achieves significant improvements on low-light image enhancement (LOL dataset) and real image dehazing (Dense-Haze and NH-Haze datasets), outperforming recent state-of-the-art methods.	The study primarily focuses on exploring the potential of attention without FFN within a specific network architecture (similar to SFNet). Investigating its effectiveness when integrated with other architectures would be beneficial. Further research on exploring the combination of continuous attention learning with other efficient designs could potentially lead to even better performance.	image restoration, continuous scaling attention, transformer, attention mechanism, feed-forward network
2403.10335 Report	NECA: Neural Customizable Human Avatar	Junjin Xiao, Qing Zhang, Zhan Xu, Wei-Shi Zheng	Human avatar has become a novel type of 3D asset with various applications. Ideally, a human avatar should be fully customizable to accommodate different settings and environments. In this work, we introduce NECA, an approach capable of learning versatile human representation from monocular or sparse-view videos, enabling granular customization across aspects such as pose, shadow, shape, lighting and texture. The core of our approach is to represent humans in complementary dual spaces and predict disentangled neural fields of geometry, albedo, shadow, as well as an external lighting, from which we are able to derive realistic rendering with high-frequency details via volumetric rendering. Extensive experiments demonstrate the advantage of our method over the state-of-the-art methods in photorealistic rendering, as well as various editing tasks such as novel pose synthesis and relighting. The code is available at https://github.com/iSEE-Laboratory/NECA.	NECA, a novel framework for learning fully customizable neural human avatars from monocular or sparse-view videos.	Human avatars need to be fully editable for diverse applications in the metaverse, telepresence, and 3D games. Previous methods only offer limited editing capabilities.	Represents humans in dual spaces (canonical and surface) to capture high-frequency details and geometry-aware characteristics. Predicts disentangled neural fields for geometry, albedo, shadow, and lighting for flexible control. Trained in a self-supervised manner with photometric losses and normal regularization.	Outperforms state-of-the-art methods in novel pose synthesis and relighting on ZJU-MoCap, NeuMan, DeepCap, DynaCap, and a synthetic dataset. Enables shape, texture, and shadow editing, including reshaping, retexturing, shadow removal, and local shadow transfer. Achieves high-fidelity rendering and diverse customization by disentangling neural fields and optimizing lighting representation.	Performance can be sensitive to the accuracy of estimated SMPL parameters. Shadows under complex novel poses may be erroneous due to the lack of explicit visibility modeling. Future work includes exploring more robust shape and texture editing, as well as generalizing the method to handle multiple humans.	human avatar, neural rendering, disentangled representation, customization, relighting
2403.10242 Report	FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model	Qijun Feng, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang	Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website https://qjfeng.net/FDGaussian/.	Presents FDGaussian, a novel two-stage framework for single-image 3D reconstruction using a geometric-aware diffusion model and accelerated Gaussian Splatting.	Addresses the limitations of current single-view 3D reconstruction methods that struggle with multi-view inconsistency or lack of geometric fidelity.	1. Employs an orthogonal plane decomposition mechanism to extract 3D geometric features from the input image for consistent multi-view image generation using a diffusion model. 2. Introduces epipolar attention to fuse the generated multi-view images during Gaussian Splatting, improving geometric reconstruction. 3. Proposes Gaussian Divergent Significance (GDS) to accelerate optimization by avoiding unnecessary split and clone operations.	FDGaussian outperforms baseline methods in novel view synthesis and single-image 3D reconstruction on Objaverse and Google Scanned Objects datasets. The orthogonal plane decomposition mechanism significantly improves multi-view consistency and geometric accuracy. GDS accelerates the optimization process by up to 15 times without compromising reconstruction quality.	The number of generated views is fixed, limiting potential efficiency gains for objects with varying topological symmetries. The current framework is limited to single-object reconstruction and cannot handle complex scenes or multiple objects.	3d reconstruction, gaussian splatting, diffusion model, multi-view consistency, single image
2403.10211 Report	BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution	Feng Li, Yixuan Wu, Zichao Liang, Runmin Cong, Huihui Bai, Yao Zhao, Meng Wang	Diffusion models (DM) have achieved remarkable promise in image super-resolution (SR). However, most of them are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adaptability to real-world applications that involve complex unknown degradations. In this work, we propose BlindDiff, a DM-based blind SR method to tackle the blind degradation settings in SISR. BlindDiff seamlessly integrates the MAP-based optimization into DMs, which constructs a joint distribution of the low-resolution (LR) observation, high-resolution (HR) data, and degradation kernels for the data and kernel priors, and solves the blind SR problem by unfolding MAP approach along with the reverse process. Unlike most DMs, BlindDiff firstly presents a modulated conditional transformer (MCFormer) that is pre-trained with noise and kernel constraints, further serving as a posterior sampler to provide both priors simultaneously. Then, we plug a simple yet effective kernel-aware gradient term between adjacent sampling iterations that guides the diffusion model to learn degradation consistency knowledge. This also enables to joint refine the degradation model as well as HR images by observing the previous denoised sample. With the MAP-based reverse diffusion process, we show that BlindDiff advocates alternate optimization for blur kernel estimation and HR image restoration in a mutual reinforcing manner. Experiments on both synthetic and real-world datasets show that BlindDiff achieves the state-of-the-art performance with significant model complexity reduction compared to recent DM-based methods. Code will be available at \url{https://github.com/lifengcs/BlindDiff}	This paper proposes BlindDiff, a novel diffusion model-based blind image super-resolution method that integrates MAP-based optimization with diffusion models for robust and efficient super-resolution under unknown degradation settings.	Most existing diffusion model-based super-resolution methods assume known degradation settings, limiting their applicability to real-world scenarios with complex and unknown degradations. BlindDiff addresses this limitation by jointly estimating the blur kernel and the high-resolution image in a mutually reinforcing manner.	BlindDiff formulates the blind super-resolution problem under a maximum a posteriori (MAP) framework and unfolds it along the reverse diffusion process. It introduces a modulated conditional transformer (MCFormer) as the denoising network, trained with noise and kernel constraints to provide data and kernel priors. A kernel-aware gradient term guides the model to learn degradation consistency knowledge, enabling alternate optimization of blur kernels and HR images during the reverse process.	BlindDiff achieves state-of-the-art performance on benchmark datasets, significantly outperforming existing DM-based methods in terms of FID and LPIPS. BlindDiff maintains high performance on both isotropic and anisotropic Gaussian blur kernels, demonstrating its robustness to different degradation types. BlindDiff demonstrates promising results on real-world images with unknown degradations, indicating its practical applicability.	The computational cost of BlindDiff, although lower than other DM-based methods, is still higher than some CNN-based methods. Future work could focus on extending BlindDiff to handle more complex real-world degradation scenarios, such as spatially variant blur.	blind super-resolution, diffusion models, map optimization, modulated conditional transformer, kernel estimation
2403.10191 Report	Generative Region-Language Pretraining for Open-Ended Object Detection	Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai	In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .	Introduces "generative open-ended object detection," a new object detection paradigm that eliminates the need for predefined object categories during inference by formulating it as a generative problem.	Addresses limitations of existing open-vocabulary object detection methods that still require predefined categories during inference, aiming for a more general and practical approach.	Proposes GenerateU, a novel end-to-end framework comprising an open-world object detector and a language model. Leverages a small set of human-annotated object-language paired data and scales up vocabulary size with massive image-text pairs, using a pseudo-labeling method to enrich label diversity.	GenerateU achieves comparable results to open-vocabulary object detection methods on zero-shot LVIS, despite not seeing object categories during inference. End-to-end training of both image encoder and language model is crucial for optimal performance in generative open-ended object detection. Beam search significantly improves recognition of rare object categories, effectively addressing the long-tail problem.	Future work includes investigating the impact of training data scale on performance. Exploring more sophisticated pseudo-labeling methods beyond the naive approach used in the paper is another promising direction.	open-ended object detection, generative object detection, zero-shot learning, multimodal learning, vision and language
2403.10179 Report	Animate Your Motion: Turning Still Images into Dynamic Videos	Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars	In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.	The paper introduces Scene and Motion Conditional Diffusion (SMCD), a novel diffusion-based video generation model that leverages both scene and motion cues (images and bounding box sequences) alongside text prompts.	Existing text-to-video generation methods often struggle to accurately reflect user intentions, relying solely on either semantic (images, depth maps) or motion-based (sketches, bounding boxes) conditions. SMCD addresses this by integrating both, allowing for more customized and controlled video generation.	SMCD, built upon a pretrained text-to-video diffusion model, incorporates a motion integration module (MIM) for encoding box locations and a dual image integration module (DIIM) for embedding image conditions. It employs a two-stage training pipeline, focusing first on motion integration and then on image and temporal coherence.	SMCD significantly outperforms existing methods in terms of video quality (FVD), demonstrating the effectiveness of incorporating both scene and motion conditions. The model accurately grounds objects to their specified trajectories while preserving the semantic details of the input image. Ablation studies highlight the importance of both MIM and DIIM, demonstrating that their synergistic integration within SMCD yields optimal results.	Relying solely on bounding boxes for motion control can be insufficient as similar changes can result from camera movement, necessitating the incorporation of camera constraints in future work. SMCD currently faces challenges in generating high-quality videos featuring humans, inheriting this limitation from the pretrained ModelScope backbone.	video generation, controllable generation, diffusion models, multimodal learning, scene and motion conditioning
2403.10166 Report	SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation	Peng Zheng, Tao Liu, Zili Yi, Rui Ma	With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.	This paper proposes SemanticHuman-HD, a novel method for high-resolution (1024 x 1024) 3D-aware human image synthesis with semantic disentanglement, allowing independent generation and manipulation of different semantic parts (e.g., body, tops, bottoms).	Existing methods for 3D human image synthesis lack semantic disentanglement or are limited to lower resolutions, hindering applications like virtual try-on and garment generation.	The method employs a two-stage training process: (1) synthesizing images, depth maps, semantic masks, and normal maps at 256 x 256 resolution using a semantic disentangled NeRF with local generators; (2) upsampling to 1024 x 1024 resolution using a novel 3D-aware super-resolution module guided by the depth and semantic information.	SemanticHuman-HD achieves superior image quality (measured by FID and KID) compared to state-of-the-art methods at both 512 x 512 and 1024 x 1024 resolutions. The method enables various applications, including semantic-aware virtual try-on, 3D garment generation, controllable image synthesis, and out-of-domain image synthesis. The proposed 3D-aware super-resolution module significantly reduces computational cost by reducing the number of sampling points during volume rendering.	The quality of synthesized results is limited by the diversity of poses and viewpoints in the training dataset. Achieving realistic hand deformations remains a challenge.	generative models, 3d human image synthesis, semantic disentanglement, neural radiance fields (nerf), super-resolution
2403.10147 Report	GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time	Hao Li, Yuanyuan Gao, Chenming Wu, Dingwen Zhang, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han	This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences.	GGRt is the first pose-free generalizable 3D Gaussian splatting framework for novel view synthesis, achieving real-time rendering at over 100 FPS and inference speeds exceeding 5 FPS.	Existing generalizable novel view synthesis methods suffer from limitations such as requiring real camera poses, struggling with high-resolution images, and lacking real-time rendering capabilities. This limits their applicability in real-world scenarios.	GGRt consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model trained jointly. It utilizes a deferred back-propagation mechanism for high-resolution processing and a Gaussians cache module for efficiency.	Outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. Achieves competitive performance compared to pose-based 3D-GS methods, even without camera pose prior. Enables real-time rendering at over 100 FPS and inference at over 5 FPS, outperforming previous state-of-the-art.	Relies on the assumption of static scenes, limiting its application in dynamic environments. Future work includes exploring the integration of temporal information to handle dynamic objects.	novel view synthesis, 3d gaussian splatting, pose-free, generalizable, real-time rendering
2403.10133 Report	E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance	Tianrui Huang, Pu Cao, Lu Yang, Chun Liu, Mengjie Hu, Zhiwei Liu, Qing Song	Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we prioritize the editability by proposing a zero-shot image editing method, named \textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}), which only requires inference-stage optimization to explicitly enhance the edibility and text alignment. Specifically, we develop a unified dual-branch feature-sharing pipeline that enables the preservation of the structure or texture of the source image while allowing the other to be adapted based on the editing task. We further integrate CLIP guidance into our pipeline by utilizing our novel random-gateway optimization mechanism to efficiently enhance the semantic alignment with the target prompt. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.	Introduces E4C, a zero-shot text-guided image editing method enhancing editability and text alignment via efficient CLIP guidance, addressing limitations in handling diverse editing tasks and text alignment in existing methods.	Existing methods struggle to handle both structure-consistent and non-rigid editing tasks and often prioritize preserving source information over aligning new content with the target prompt.	Employs a dual-branch feature-sharing pipeline for adaptive preservation of source image information, combined with a random-gateway optimization mechanism for efficient CLIP guidance to enhance text alignment.	Achieves superior visual quality across various editing tasks compared to existing methods. Demonstrates higher CLIP score, indicating better text alignment, while maintaining comparable image fidelity. Exhibits effectiveness in handling hard samples, like multi-object scenarios and complex shape/pose changes.	Exhibits limitations in the human face domain, especially with high-resolution images. Ambiguous language descriptions can lead to unreasonable visual representations.	diffusion model, text-based image editing, clip guidance, image manipulation, zero-shot learning
2403.10098 Report	DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face Restoration	Nan Gao, Jia Li, Huaibo Huang, Zhi Zeng, Ke Shang, Shuwu Zhang, Ran He	Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of degradation patterns. Current methods have low generalization across photorealistic and heterogeneous domains. In this paper, we propose a Diffusion-Information-Diffusion (DID) framework to tackle diffusion manifold hallucination correction (DiffMAC), which achieves high-generalization face restoration in diverse degraded scenes and heterogeneous domains. Specifically, the first diffusion stage aligns the restored face with spatial feature embedding of the low-quality face based on AdaIN, which synthesizes degradation-removal results but with uncontrollable artifacts for some hard cases. Based on Stage I, Stage II considers information compression using manifold information bottleneck (MIB) and finetunes the first diffusion model to improve facial fidelity. DiffMAC effectively fights against blind degradation patterns and synthesizes high-quality faces with attribute and identity consistencies. Experimental results demonstrate the superiority of DiffMAC over state-of-the-art methods, with a high degree of generalization in real-world and heterogeneous settings. The source code and models will be public.	Proposes DiffMAC, a Diffusion-Information-Diffusion (DID) framework for high-generalization blind face restoration (BFR) across photorealistic and heterogeneous domains.	Current BFR methods struggle with generalization across diverse degraded scenes and heterogeneous domains, especially for severely degraded images.	DID uses two stages: 1) Aligns restored face with LQ face features using AdaIN-based diffusion. 2) Applies Manifold Information Bottleneck (MIB) for information compression and finetunes the diffusion model for fidelity improvement with identity preservation.	Achieves high-fidelity BFR in photorealistic and heterogeneous domains, outperforming state-of-the-art methods. Effectively tackles diffusion manifold hallucination correction by disentangling restoration-relevant and irrelevant information. Demonstrates the effectiveness of MIB with identity information injection for controllable and high-quality BFR.	Challenges remain in handling BFR for unseen scenarios with severely degraded facial contours. Inference time is longer than some methods due to the two-stage design with MIB; exploring efficient distillation of DDIM sampling is planned.	blind face restoration, diffusion models, information bottleneck, generative adversarial networks, image restoration
2403.10071 Report	Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling	Baoquan Zhang, Huaibin Wang, Luo Chuyao, Xutao Li, Liang Guotao, Yunming Ye, Xiaochen Qi, Yao He	Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.	Proposes VQCT, a novel codebook transfer framework using part-of-speech, to improve Vector-Quantized Image Modeling (VQIM) by transferring pretrained codebooks from language models (e.g., CLIP) to enhance VQIM codebook learning and alleviate codebook collapse.	VQIM suffers from codebook collapse where many code vectors remain unoptimized. Existing methods learn codebooks from scratch, neglecting potentially beneficial relationships between codes. This paper argues that leveraging pretrained language model codebooks can provide rich semantic information and relationships for more robust VQIM.	1. Construct vision-related codebooks (adjective and noun) from pretrained language models using part-of-speech filtering. 2. Design a graph convolution-based codebook transfer network to transfer knowledge from these codebooks to VQIM. 3. Use the transferred codebooks for quantizing continuous image representations.	VQCT outperforms state-of-the-art VQIM methods in image reconstruction tasks on four datasets. VQCT demonstrates higher codebook utilization compared to baselines, indicating alleviation of codebook collapse. VQCT shows promising results on downstream semantic image synthesis tasks.	VQCT's performance improvement depends on the quality and relevance of the chosen pretrained language model. Further exploration of better strategies for transferring codebook knowledge from language to vision domain is needed.	vqim, codebook transfer, pretrained language models, image synthesis, codebook collapse
2403.10050 Report	Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing	Tian-Xing Xu, Wenbo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang	3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namely Texture-GS, to disentangle the appearance from the geometry by representing it as a 2D texture mapped onto the 3D surface, thereby facilitating appearance editing. Technically, the disentanglement is achieved by our proposed texture mapping module, which consists of a UV mapping MLP to learn the UV coordinates for the 3D Gaussian centers, a local Taylor expansion of the MLP to efficiently approximate the UV coordinates for the ray-Gaussian intersections, and a learnable texture to capture the fine-grained appearance. Extensive experiments on the DTU dataset demonstrate that our method not only facilitates high-fidelity appearance editing but also achieves real-time rendering on consumer-level devices, e.g. a single RTX 2080 Ti GPU.	Texture-GS disentangles geometry and texture for 3D Gaussian Splatting, enabling real-time appearance editing like texture swapping.	3D Gaussian Splatting, despite its efficiency, entangles appearance and geometry, hindering flexible editing. Texture-GS overcomes this limitation.	It uses a UV mapping MLP with Taylor expansion for efficient ray-Gaussian intersection to UV mapping, representing appearance in a 2D texture.	Reconstructs smooth, high-quality 2D texture maps from multi-view images. Enables global texture swapping and fine-grained texture editing. Achieves real-time rendering speed (58 FPS on RTX 2080 Ti) for interactive editing.	Blurring at edges due to inaccurate Gaussian orientations impacting UV mapping. Single UV space limits representation for scenes with multiple objects or complex geometries.	3d gaussian splatting, texture mapping, neural rendering, appearance editing, real-time rendering
2403.10004 Report	ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images	Xiangtian Xue, Jiasong Wu, Youyong Kong, Lotfi Senhadji, Huazhong Shu	We present a novel image editing scenario termed Text-grounded Object Generation (TOG), defined as generating a new object in the real image spatially conditioned by textual descriptions. Existing diffusion models exhibit limitations of spatial perception in complex real-world scenes, relying on additional modalities to enforce constraints, and TOG imposes heightened challenges on scene comprehension under the weak supervision of linguistic information. We propose a universal framework ST-LDM based on Swin-Transformer, which can be integrated into any latent diffusion model with training-free backward guidance. ST-LDM encompasses a global-perceptual autoencoder with adaptable compression scales and hierarchical visual features, parallel with deformable multimodal transformer to generate region-wise guidance for the subsequent denoising process. We transcend the limitation of traditional attention mechanisms that only focus on existing visual features by introducing deformable feature alignment to hierarchically refine spatial positioning fused with multi-scale visual and linguistic information. Extensive Experiments demonstrate that our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models.	This paper introduces Text-grounded Object Generation (TOG), a novel image editing task focused on generating new objects in real images based on textual descriptions of visual and spatial attributes, and proposes ST-LDM, a universal framework to address this task.	Existing diffusion models struggle with spatial understanding in complex scenes and rely on additional modalities for spatial control. TOG addresses this by leveraging the flexibility and naturalness of language for object placement in images.	ST-LDM uses a Swin-Transformer-based autoencoder for adaptable latent representation and a parallel multimodal transformer to generate spatial guidance. It introduces deformable feature alignment to refine object placement using multi-scale visual and linguistic features and integrates with LDMs via training-free backward guidance.	ST-LDM demonstrates superior performance compared to existing text-guided editing models, particularly in complex scenes. Deformable feature alignment is shown to significantly improve object localization accuracy while preserving the generative capabilities of diffusion models. Quantitative and qualitative evaluations on a newly constructed benchmark dataset showcase the effectiveness and robustness of the proposed approach.	Current implementation requires separate input of appearance and spatial descriptions, which limits its practical application in real-world scenarios where integrated statements are common. The editing process can sometimes lead to slight changes in irrelevant regions near the generated object, highlighting the need for further exploration of methods to maintain pixel-level fidelity.	image editing, text-guided generation, deformable feature alignment, latent diffusion models, swin-transformer
2403.09981 Report	Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting	Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu	While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.	This paper introduces MVControl, a novel neural network architecture for controllable text-to-3D generation, and proposes an efficient multi-stage pipeline for generating high-quality textured 3D meshes from 3D Gaussians.	Controllable text-to-3D generation is an important but under-explored area, and existing methods are either time-consuming or struggle to produce high-quality results. This work addresses these limitations.	MVControl, a multi-view variant of ControlNet, is trained on a large 3D dataset to enable controllable text-to-multi-view image generation. These images are then used to initialize a set of coarse 3D Gaussians, which are further optimized using a hybrid diffusion guidance approach and SuGaR regularization. Finally, a textured mesh is extracted and refined.	MVControl effectively controls multi-view image generation, enabling fine-grained control over content and achieving view consistency. The proposed 3D generation pipeline outperforms existing Gaussian-based mesh generation approaches, producing high-fidelity and detailed textured meshes. The hybrid diffusion guidance approach combining MVControl and a 2D diffusion model effectively optimizes the geometry and texture of the generated 3D assets.	The current implementation requires a separate 2D diffusion model for texture refinement. Further exploration of different 3D Gaussian initialization strategies could improve efficiency.	controllable 3d generation, gaussian splatting, sugar, multi-view diffusion models, score distillation sampling
2403.09977 Report	EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba	Xiaohuan Pei, Tao Huang, Chang Xu	Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.	This paper introduces EfficientVMamba, a lightweight state-space model for vision tasks that efficiently balances global and local feature extraction by combining an atrous-based selective scan mechanism with convolutional branches.	Existing lightweight models, based on CNNs or Transformers, struggle to achieve both global representation and computational efficiency. EfficientVMamba addresses this by utilizing the linear complexity of state space models for global context while integrating convolutions for local features.	The authors propose Efficient 2D Scanning (ES2D) using skip sampling for efficient global representation. They introduce an Efficient Visual State Space (EVSS) block merging ES2D with a convolutional branch enhanced by Squeeze-and-Excitation. An 'inverted' insertion strategy prioritizes EVSS in early stages and convolutions in later stages.	EfficientVMamba achieves state-of-the-art accuracy with reduced FLOPs on ImageNet classification compared to CNN-based and Transformer-based counterparts. It shows superior performance on COCO object detection using RetinaNet, exceeding models with larger parameter counts. EfficientVMamba demonstrates competitive results on ADE20K semantic segmentation with UperNet, highlighting its efficient and accurate segmentation capability.	The computational design of SSMs is more complex than convolutions or self-attention, posing challenges for parallel processing. Future work can explore further optimization of computational efficiency and scalability for visual state space models.	light-weight architecture, efficient network, state space model, atrous selective scan, vision transformer
2403.09939 Report	Quantization Effects on Neural Networks Perception: How would quantization change the perceptual field of vision models?	Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Alessandro Bruno	Neural network quantization is an essential technique for deploying models on resource-constrained devices. However, its impact on model perceptual fields, particularly regarding class activation maps (CAMs), remains a significant area of investigation. In this study, we explore how quantization alters the spatial recognition ability of the perceptual field of vision models, shedding light on the alignment between CAMs and visual saliency maps across various architectures. Leveraging a dataset of 10,000 images from ImageNet, we rigorously evaluate six diverse foundational CNNs: VGG16, ResNet50, EfficientNet, MobileNet, SqueezeNet, and DenseNet. We uncover nuanced changes in CAMs and their alignment with human visual saliency maps through systematic quantization techniques applied to these models. Our findings reveal the varying sensitivities of different architectures to quantization and underscore its implications for real-world applications in terms of model performance and interpretability. The primary contribution of this work revolves around deepening our understanding of neural network quantization, providing insights crucial for deploying efficient and interpretable models in practical settings.	This paper investigates the impact of quantization on the perceptual fields of neural network vision models by analyzing how quantization affects Class Activation Maps (CAMs) and their alignment with human visual saliency maps.	Quantization is essential for deploying models on resource-constrained devices, but its impact on model interpretability, particularly regarding CAMs, needs to be understood.	The study uses a dataset of 10,000 ImageNet images and six foundational CNN architectures (VGG16, ResNet50, EfficientNet, MobileNet, SqueezeNet, DenseNet). The authors apply quantization techniques, generate CAMs and visual saliency maps, and compare them using metrics like Similarity, Kullback-Leibler Divergence, and Pearson Correlation.	Quantization with int16 precision often yields a better balance between model efficiency and alignment with human perception compared to f32 and int8. MobileNet and SqueezeNet demonstrate high robustness to quantization, maintaining consistent CAM alignment with visual saliency. EfficientNet shows higher sensitivity to quantization, exhibiting more significant changes in CAMs and reduced alignment with human perception.	The study primarily focuses on image classification tasks and a limited set of architectures. Future work can explore the impact of quantization on other vision tasks and more complex architectures.	neural network quantization, class activation maps, model interpretability, visual saliency, computer vision
2403.09746 Report	PICNIQ: Pairwise Comparisons for Natural Image Quality Assessment	Nicolas Chahine, Sira Ferradans, Jean Ponce	Blind image quality assessment (BIQA) approaches, while promising for automating image quality evaluation, often fall short in real-world scenarios due to their reliance on a generic quality standard applied uniformly across diverse images. This one-size-fits-all approach overlooks the crucial perceptual relationship between image content and quality, leading to a 'domain shift' challenge where a single quality metric inadequately represents various content types. Furthermore, BIQA techniques typically overlook the inherent differences in the human visual system among different observers. In response to these challenges, this paper introduces PICNIQ, an innovative pairwise comparison framework designed to bypass the limitations of conventional BIQA by emphasizing relative, rather than absolute, quality assessment. PICNIQ is specifically designed to assess the quality differences between image pairs. The proposed framework implements a carefully crafted deep learning architecture, a specialized loss function, and a training strategy optimized for sparse comparison settings. By employing psychometric scaling algorithms like TrueSkill, PICNIQ transforms pairwise comparisons into just-objectionable-difference (JOD) quality scores, offering a granular and interpretable measure of image quality. We conduct our research using comparison matrices from the PIQ23 dataset, which are published in this paper. Our extensive experimental analysis showcases PICNIQ's broad applicability and superior performance over existing models, highlighting its potential to set new standards in the field of BIQA.	This appendix presents supplementary information to the main paper "PICNIQ: Pairwise Comparisons for Natural Image Quality Assessment," showing examples of PICNIQ's preference predictions for image quality, with a focus on comparisons to the PIQ23 dataset.	The appendix provides visual evidence and analysis to support the claims made in the main paper about PICNIQ's performance in predicting human image quality preferences.	The appendix uses visual examples of image pairs, comparison matrices for different scenes and attributes, and probability distribution plots for PIQ23 dataset to illustrate PICNIQ's prediction capabilities.	PICNIQ demonstrates more logical and precise image quality comparisons than previous methods, even for challenging cases. The comparison matrices highlight PICNIQ's ability to differentiate between different scenes and attributes. The PIQ23 dataset shows imbalances in its distribution, with a bias towards forced-choice pairs (0s and 1s).	The appendix relies heavily on visual examples, which may be subjective and open to interpretation. Further investigation is needed to address the distribution imbalances in the PIQ23 dataset.	image quality assessment, pairwise comparisons, picniq, piq23, preference prediction
2403.09669 Report	STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models	Pum Jun Kim, Seojun Kim, Jaejun Yoo	Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate even short video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at https://github.com/pro2nit/STREAM.	This paper proposes STREAM, a novel video evaluation metric designed to independently assess the spatial and temporal aspects of videos generated by generative models.	Existing video evaluation metrics, often adapted from image metrics, fail to adequately capture the unique characteristics of video data, particularly temporal consistency. This limits the development and analysis of increasingly sophisticated video generative models.	STREAM leverages an image embedding network to encode individual video frames, enabling separate analysis of spatial and temporal aspects. STREAM-T evaluates temporal flow by analyzing the skewness of the power law distribution of frequency amplitudes over time. STREAM-S evaluates spatial quality through STREAM-F (fidelity) and STREAM-D (diversity) by adapting precision and recall calculations to video data.	STREAM effectively captures visual and temporal degradation in both synthetic and real-world video data, showing consistent and interpretable results across various experiments. Analysis of popular video generative models using STREAM reveals challenges in generating realistic and diverse videos, especially as video length increases. Unlike FVD, which is limited by its embedding network, STREAM can evaluate videos of arbitrary length, supporting the development of long-form video generation models.	While STREAM-T effectively evaluates temporal flow, it remains agnostic to the direction of time, potentially leading to limitations in specific scenarios. Future work could explore incorporating a human judgment study to further validate and calibrate STREAM, particularly in terms of its ability to quantify video diversity.	video generation, evaluation metric, generative models, computer vision, deep learning
2403.09638 Report	SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior	Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao	Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.The code and models can be accessed via the project page.	This paper introduces SCP-Diff, a novel approach for photo-realistic semantic image synthesis using spatial-categorical joint priors with diffusion models.	Current GAN-based semantic image synthesis methods struggle to achieve photorealism, and diffusion-based methods like ControlNet face challenges with sub-par image quality and misalignment with semantic masks.	The authors propose pre-computed noise priors (spatial, categorical, and joint) derived from real image latents to guide the inference process of a finetuned ControlNet model, tackling the distribution mismatch between training and inference.	SCP-Diff achieves state-of-the-art FID scores on Cityscapes (10.53) and ADE20K (12.66) datasets, significantly improving upon previous methods. The joint prior effectively combines the strengths of spatial and categorical priors, resulting in better scene layout and adherence to semantic masks. Quantitative analysis demonstrates that while improving quality, the introduction of priors has a minimal impact on the diversity of generated images.	The performance on the COCO-Stuff dataset is on par with leading methods but not significantly better, potentially due to the dataset's diverse spatial resolutions. Future research could explore incorporating correlations between spatial tokens and classes in the joint prior for potential further improvements.	semantic image synthesis, diffusion models, noise priors, controlnet, photo-realistic image generation
2403.09632 Report	Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image	Yiqun Mei, Yu Zeng, He Zhang, Zhixin Shu, Xuaner Zhang, Sai Bi, Jianming Zhang, HyunJoon Jung, Vishal M. Patel	At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features, and predict a relit 3D representation in the form of a tri-plane, which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control, Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs, Holo-Relighting can generate complex non-Lambertian lighting effects (e.g., specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage, and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments, we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism, 3D consistency and controllability.	This paper presents Holo-Relighting, a novel volumetric relighting method for headshot portraits that allows for controlling lighting, viewpoint, and head pose from a single image.	Existing portrait relighting methods often lack view consistency or rely on simplified lighting models, limiting their expressiveness and realism. Holo-Relighting addresses these limitations, enabling more realistic and controllable portrait editing.	The method leverages a pre-trained 3D GAN (EG3D) to extract 3D information from the input. It then employs a relighting network conditioned on the target lighting, head pose, and camera pose to generate a 3D representation (tri-plane features) with embedded illumination. Finally, it synthesizes novel view images through volume rendering.	Holo-Relighting achieves state-of-the-art results on both free-view and 2D portrait relighting tasks, outperforming existing methods in perceptual quality, fidelity, and identity preservation. The method demonstrates strong controllability, allowing for realistic manipulation of lighting direction and intensity, head pose, and viewpoint, as well as achieving effects like shadow diffusion. The authors introduce novel data rendering techniques, including multi-view GAN inversion and portrait shading transfer, which improve the accuracy of 3D geometry encoding and contribute to the high quality of the relighting results.	The current method is trained on headshot portraits and may not generalize well to full-body images. Future work can explore incorporating dynamic details like hair movement and facial expressions to enhance realism further.	volumetric relighting, portrait editing, 3d gan, gan inversion, view synthesis
2403.09626 Report	Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding	Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang	Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.	This paper presents a comprehensive study exploring the potential of State Space Models (SSMs), particularly Mamba, as a viable alternative to Transformers for video understanding tasks.	SSMs, especially Mamba, offer linear scaling with sequence length, making them potentially more efficient for video modeling compared to Transformers. However, their effectiveness in video understanding remains largely unexplored.	The authors introduce "Video Mamba Suite," comprising 14 SSM models/modules, and evaluate their performance on 12 video understanding tasks across 13 datasets. They explore four distinct roles of Mamba in video modeling: temporal models, temporal modules, multi-modal interaction models, and space-time sequence models.	Mamba-based models demonstrate competitive or superior performance compared to Transformer counterparts across various video understanding tasks, including temporal action localization, temporal action segmentation, dense video captioning, action anticipation, and video temporal grounding. Mamba exhibits strong capabilities in modeling long video sequences, evidenced by its superior performance in long-form video question answering. Mamba models offer computational efficiency advantages over Transformers, particularly when processing videos with a large number of frames.	The study primarily focuses on replacing Transformer blocks with Mamba blocks, leaving the exploration of SSM-based module designs for video understanding as future work. Further investigation is needed to optimize the integration of SSMs, especially in multi-modal settings, where hyperparameter tuning can impact performance.	video understanding, state space model, mamba, video modeling, temporal action localization
2403.09625 Report	Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation	Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, Yueqi Duan	Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject. Specifically, we design a co-evolution framework to reduce the variance of distributions, where each model undergoes a process of learning from the other through identity-aware optimization and subject-prior optimization, respectively. Extensive experiments demonstrate that our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.	Presents Make-Your-3D, a novel co-evolution framework for fast and consistent subject-driven 3D content generation from a single image.	Addresses the limitations of existing 3D generation methods in creating subject-specific content with text-driven modifications, enabling diverse and personalized 3D asset creation.	Harmonizes the distributions of a 2D personalized model and a multi-view diffusion model with the target subject's distribution through identity-aware and subject-prior optimization.	Generates high-fidelity 3D content with strong subject identity preservation and text-driven modifications. Achieves significantly faster generation speed (5 minutes) compared to previous methods (3 hours). Demonstrates robustness in open-vocabulary settings and surpasses baselines in qualitative and quantitative evaluations, including user studies.	Current quality is limited by the backbone model (Stable Diffusion v1.5), which can be improved by using larger diffusion models like SDXL. Future work will explore 3D scene-level personalization.	3d generation, personalization, co-evolution, diffusion models, one-shot learning
2403.09623 Report	Score-Guided Diffusion for 3D Human Recovery	Anastasis Stathopoulos, Ligong Han, Dimitris Metaxas	We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations, traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score, ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available at the https://statho.github.io/ScoreHMR.	This paper presents ScoreHMR, a method that uses diffusion models and score guidance to refine 3D human pose estimations from images and videos.	Current methods for 3D human pose estimation, based on either regression or optimization, struggle to achieve both accuracy and image-model alignment. This work leverages the power of diffusion models to learn priors over human poses and use score guidance for more accurate and robust refinement.	ScoreHMR utilizes a diffusion model trained on a dataset of human poses conditioned on images. Given an initial pose estimate from a regression network, it iteratively refines the pose in the latent space of the diffusion model using score guidance derived from image observations like 2D keypoints, multi-view consistency, or temporal smoothness.	ScoreHMR outperforms existing optimization-based methods for fitting a 3D human body model to 2D keypoint detections on 3DPW and EMDB datasets. It effectively refines multi-view predictions by enforcing cross-view consistency, achieving superior results compared to single-view reconstruction and optimization-based methods on Human3.6M and Mannequin Challenge datasets. ScoreHMR significantly improves the temporal consistency of human motion in video sequences, leading to lower acceleration errors and smoother reconstructions on 3DPW and EMDB datasets.	The reliance on pseudo-ground-truth pose annotations for training the diffusion model might limit the performance, especially for unusual poses not well represented in the training data. The current implementation primarily focuses on refining the pose parameters of the SMPL model, and future work could explore extending ScoreHMR to jointly model and refine both pose and shape parameters.	3d human pose estimation, diffusion models, score guidance, human mesh recovery, multi-view refinement
2403.09622 Report	Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering	Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan	Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.	This paper introduces Glyph-ByT5, a customized text encoder designed for generating accurate visual text in diffusion models, leading to the development of Glyph-SDXL for text-rich design images and scene text rendering.	Accurate text rendering is crucial for various image generation applications, ranging from design materials to real-world scenes, and existing models often struggle with this task.	The authors create a scalable glyph-text dataset using graphic rendering, pre-train ByT5 on this dataset for glyph-text alignment, and integrate it into SDXL with a region-wise cross-attention mechanism.	Glyph-SDXL significantly outperforms commercial products and state-of-the-art models in design-text rendering accuracy. The model achieves high spelling accuracy for paragraphs with automated multi-line layout. Fine-tuning Glyph-SDXL on a hybrid design-to-scene dataset improves scene-text generation.	The layout planning with GPT-4, while promising, still faces challenges in certain scenarios. Future work includes expanding the dataset and exploring more advanced vision encoders.	text rendering, diffusion models, text encoder, glyph-byt5, sdxl
2403.09620 Report	PosSAM: Panoptic Open-vocabulary Segment Anything	Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli	In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/.	Introduces PosSAM, an open-vocabulary panoptic segmentation model unifying Segment Anything Model (SAM) with CLIP for end-to-end instance-aware mask generation and classification.	Addresses limitations of SAM, which excels in class-agnostic masks but lacks instance and class awareness, hindering its use in open-vocabulary segmentation tasks.	Leverages SAM's spatial features for mask generation, CLIP for semantic features, introduces a Local Discriminative Pooling (LDP) module for unbiased classification, and employs Mask-Aware Selective Ensembling (MASE) for robust inference.	Achieves state-of-the-art performance on COCO to ADE20K and ADE20K to COCO zero-shot open-vocabulary panoptic segmentation, outperforming previous methods by a large margin. Demonstrates strong generalization to unseen object categories, effectively segmenting novel objects with high accuracy. Outperforms existing methods in open-vocabulary semantic segmentation tasks, highlighting its adaptability to diverse challenges.	Reliance on CLIP backbone for semantic features limits potential for single, unified architecture. Future work could explore integrating spatial and semantic awareness within a single backbone for improved efficiency and performance.	open-vocabulary segmentation, panoptic segmentation, segment anything model (sam), clip, local discriminative pooling
2403.09616 Report	Explore In-Context Segmentation via Latent Diffusion Models	Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan	In-context segmentation has drawn more attention with the introduction of vision foundation models. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. In this work, we explore this problem from a new perspective, using one representative generation model, the latent diffusion model (LDM). We observe a task gap between generation and segmentation in diffusion models, but LDM is still an effective minimalist for in-context segmentation. In particular, we propose two meta-architectures and correspondingly design several output alignment and optimization strategies. We have conducted comprehensive ablation studies and empirically found that the segmentation quality counts on output alignment and in-context instructions. Moreover, we build a new and fair in-context segmentation benchmark that includes both image and video datasets. Experiments validate the efficiency of our approach, demonstrating comparable or even stronger results than previous specialist models or visual foundation models. Our study shows that LDMs can also achieve good enough results for challenging in-context segmentation tasks.	This paper explores the potential of Latent Diffusion Models (LDMs) for in-context segmentation by proposing a minimalist LDM-based framework (Ref LDM-Seg) that uses visual prompts for guidance without relying on additional neural networks.	This research is significant because it offers a novel perspective on in-context segmentation by leveraging the generative capabilities of LDMs, unlike traditional discriminative models or masked image modeling techniques.	The authors propose two meta-architectures for Ref LDM-Seg, incorporating instruction extraction from visual prompts, output alignment strategies to bridge the gap between image and mask channels, and optimization methods in both pixel and latent spaces.	LDMs, despite being designed for generation, can effectively perform in-context segmentation with promising results. Visual prompts and output alignment are crucial for LDM-based segmentation, determining the success and quality of segmentation, respectively. Ref LDM-Seg achieves comparable or even better performance than existing specialist models and generalist vision foundation models on a proposed in-context segmentation benchmark.	The current work is limited by the scale of training data used, which could be addressed by scaling up training data and model parameters in the future. Future research could explore advanced prompt encoder architectures and prompt engineering methods to further improve performance.	in-context segmentation, latent diffusion model, visual prompt, few-shot learning, computer vision
2403.09593 Report	Renovating Names in Open-Vocabulary Segmentation Benchmarks	Haiwen Huang, Songyou Peng, Dan Zhang, Andreas Geiger	Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, name qualities are often overlooked and lack sufficient precision in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOVATE). Through human study, we demonstrate that the names generated by our model are more precise descriptions of the visual segments and hence enhance the quality of existing datasets by means of simple renaming. We further demonstrate that using our renovated names enables training of stronger open-vocabulary segmentation models. Using open-vocabulary segmentation for name quality evaluation, we show that our renovated names lead to up to 16% relative improvement from the original names on various benchmarks across various state-of-the-art models. We provide our code and relabelings for several popular segmentation datasets (ADE20K, Cityscapes, PASCAL Context) to the research community.	This paper presents RENOVATE, a framework for improving the quality of class names in open-vocabulary segmentation benchmarks by leveraging foundation models to generate more precise and contextually relevant names.	Existing open-vocabulary segmentation models struggle with imprecise names in benchmarks, hindering their ability to generalize to novel categories and leading to inaccurate model evaluation. RENOVATE addresses this issue by providing a scalable, principled approach to renaming.	RENOVATE first uses an image captioning model and GPT-4 to generate a pool of candidate names enriched with contextual information. It then trains a renaming model to select the best-matching name for each segment based on visual-language alignment.	Human preference study confirms that RENOVATE names are preferred over original names in 82% of cases. Using RENOVATE names upgrades existing benchmarks by providing more fine-grained annotations, making them more challenging and realistic. Training open-vocabulary models with RENOVATE names improves their performance on both source and target datasets, highlighting the importance of precise names for generalization.	RENOVATE's reliance on foundation models could propagate existing biases into the new names, requiring careful verification in critical applications. The exploration of design choices is not yet exhaustive, with potential for investigating alternative language models and VLM backbones for further improvement.	vision-language models, open-vocabulary segmentation, dataset renaming, benchmark upgrading, name quality evaluation
2403.09439 Report	3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation	Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou	Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.	This paper introduces 3D-SceneDreamer, a novel framework for generating 3D scenes from text prompts while ensuring consistency across multiple views.	Existing text-to-3D methods struggle to maintain consistency, especially in complex outdoor scenes, due to reliance on error-prone depth estimation and lack of global 3D understanding.	The method uses a tri-planar feature-based NeRF for global 3D representation, progressively optimized through an incremental training strategy. A 3D-aware generative model refines novel views, leveraging pre-trained diffusion models.	Outperforms state-of-the-art text-to-scene methods in visual quality and 3D consistency. Successfully generates diverse indoor, outdoor, and unreal scenes with arbitrary camera trajectories. Reconstructs high-quality 3D meshes and point clouds, demonstrating superior 3D consistency.	Computationally intensive due to continuous optimization of the 3D representation and new content generation. Future work could explore incorporating 3D Gaussian Splatting for improved efficiency.	text-to-3d, scene generation, neural radiance fields, diffusion models, 3d consistency
2403.09413 Report	Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting	Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, Seungryong Kim	3D Gaussian splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis and 3D reconstruction. However, 3DGS heavily depends on the accurate initialization derived from Structure-from-Motion (SfM) methods. When trained with randomly initialized point clouds, 3DGS fails to maintain its ability to produce high-quality images, undergoing large performance drops of 4-5 dB in PSNR. Through extensive analysis of SfM initialization in the frequency domain and analysis of a 1D regression task with multiple 1D Gaussians, we propose a novel optimization strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting), that successfully trains 3D Gaussians from random point clouds. We show the effectiveness of our strategy through quantitative and qualitative comparisons on multiple datasets, largely improving the performance in all settings. Our project page and code can be found at https://ku-cvlab.github.io/RAIN-GS.	This paper introduces RAIN-GS, a novel optimization strategy for 3D Gaussian Splatting (3DGS) that eliminates the need for accurate point cloud initialization from SfM, enabling high-quality image rendering from randomly initialized point clouds.	3DGS heavily relies on accurate point cloud initialization derived from SfM, limiting its applicability in scenarios where SfM struggles, such as scenes with symmetry, specular properties, or limited views. RAIN-GS addresses this limitation, broadening 3DGS's applicability.	RAIN-GS combines two key components: 1) sparse-large-variance (SLV) initialization, starting with fewer Gaussians with larger initial covariances, and 2) progressive Gaussian low-pass filtering during rendering, guiding the model to learn low-frequency components first and progressively refine with high-frequency details.	RAIN-GS achieves state-of-the-art results on the Mip-NeRF360, Tanks & Temples, and Deep Blending datasets, outperforming existing methods even without SfM initialization. The strategy effectively reduces high-frequency artifacts and improves visual quality, as demonstrated in qualitative comparisons. Ablation studies validate the effectiveness of both SLV initialization and progressive Gaussian low-pass filtering.	RAIN-GS might not fully capture high-frequency details in areas where the rendering loss cannot distinguish between coarse approximations and high-frequency distributions. The reliance on L1 rendering loss as the primary supervision signal might limit the method's ability to detect the need for further densification.	3d gaussian splatting, novel view synthesis, structure-from-motion, point cloud initialization, progressive gaussian low-pass filtering
2403.09338 Report	LocalMamba: Visual State Space Model with Windowed Selective Scan	Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu	Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.	This paper introduces LocalMamba, a novel approach for vision state space models that leverages windowed selective scanning and scan direction search to enhance the capture of local dependencies within images while maintaining global contextual understanding.	Existing vision state space models struggle to effectively capture local 2D dependencies in images due to the inherent non-causal nature of 2D spatial data and the causal processing framework of SSMs. This work addresses this limitation to improve the performance of vision SSMs.	The paper introduces a local scanning strategy that divides images into distinct windows to better capture local dependencies. It also proposes a dynamic method to search for the optimal scan direction for each layer, further boosting performance.	LocalMamba models significantly outperform previous state-of-the-art methods like Vim and VMamba on ImageNet classification, object detection, and semantic segmentation tasks. The proposed local scan mechanism effectively captures local dependencies, leading to improved performance even without scan direction search. The scan direction search method identifies optimal scanning configurations for each layer, further enhancing the model's ability to capture both local and global visual cues.	The computational framework of SSMs is currently more complex than convolution or self-attention, potentially hindering efficient parallel computation. Current deep learning frameworks lack the same level of optimization for SSM computations as for more established architectures, limiting their speed.	state space models, vision mamba, local scan, scan direction search, image recognition
2403.09334 Report	Video Editing via Factorized Diffusion Distillation	Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman	We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters	Introduces \fullmodel, a state-of-the-art video editing model trained without supervised video editing data by aligning a pretrained image editing adapter and a video generation adapter.	Addresses the challenge of scarce supervised video editing data, which hinders the development of robust and versatile video editing models.	Trains image editing and video generation adapters separately, then aligns them using a novel unsupervised distillation procedure called \fullmethod, combining score distillation and adversarial losses.	\fullmodel achieves state-of-the-art results on the Text Guided Video Editing (TGVE) benchmark. The proposed method enables zero-shot video editing for tasks learned by the image editing adapter but not explicitly seen during alignment. Demonstrates generalization by aligning other adapter combinations, showing potential for personalized and stylized image editing.	Model performance limited by the capabilities of individual teacher models. \fullmethod is currently reliant on pre-trained adapters and cannot train them from scratch.	video editing, diffusion models, adapter alignment, unsupervised learning, distillation
2403.09326 Report	HeadEvolver: Text to Head Avatars via Locally Learnable Mesh Deformation	Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Ying Shan, Xiaohang Zhan, Zeyu Wang	We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce a trainable parameter as a weighting factor for the Jacobian at each triangle to adaptively change local shapes while maintaining global correspondences and facial features. Moreover, to ensure the coherence of the resulting shape and appearance from different viewpoints, we use pretrained image diffusion models for differentiable rendering with regularization terms to refine the deformation under text guidance. Extensive experiments demonstrate that our method can generate diverse head avatars with an articulated mesh that can be edited seamlessly in 3D graphics software, facilitating downstream applications such as more efficient animation with inherited blend shapes and semantic consistency.	HeadEvolver, a novel framework for generating stylized 3D head avatars from text prompts using learnable local mesh deformations.	Addresses limitations in existing text-to-3D avatar methods, particularly in achieving fine-grained semantic control over local shapes and ensuring compatibility with existing 3D graphics workflows.	Deforms a template mesh by optimizing per-triangle weighted Jacobians guided by text prompts, leveraging stable diffusion models for differentiable rendering and regularization terms for shape fidelity.	Generates high-quality head avatars with detailed facial features matching text descriptions. Preserves semantic correspondences and attributes of the template mesh, enabling smooth integration with animation and editing tools. Outperforms baseline methods in qualitative and quantitative comparisons, demonstrating superior mesh quality and text-alignment.	Currently requires manifold mesh input and faces challenges in handling non-manifold structures like eyeballs. Future work includes exploring cage-based representations for broader mesh compatibility and developing methods for automatically adding accessories like hair and glasses.	text-to-3d, avatar generation, mesh deformation, differentiable rendering, stable diffusion
2403.09281 Report	CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification	Yiming Ma, Victor Sanchez, Tanaya Guha	The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.	This paper introduces CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps by reformulating counting as a blockwise classification problem.	Existing crowd counting methods either struggle with the long-tail distribution of count values or fail to fully utilize the power of CLIP for density map estimation.	The paper proposes an Enhanced Blockwise Classification (EBC) framework that leverages integer-valued bins for discretization, corrects noisy annotations in dense areas, and employs a Distance-Aware-Cross-Entropy (DACE) loss. Building on EBC, CLIP-EBC utilizes the CLIP architecture to extract image and text features, computing their similarity to generate probability maps and subsequently density maps.	CLIP-EBC with ResNet backbone achieves state-of-the-art performance, surpassing existing methods on benchmarks like ShanghaiTech. EBC framework significantly improves the performance of existing regression-based methods like CSRNet and DMCount, showing up to 76.9% reduction in RMSE. Experiments confirm the benefits of dynamic bin granularity in EBC, balancing representative count value accuracy with increased sample size per bin.	The paper primarily focuses on human counting, leaving the exploration of CLIP-EBC's capacity for counting other objects for future work. Potential ethical concerns regarding privacy and bias in crowd counting applications require further investigation.	crowd counting, clip, density map estimation, blockwise classification, deep learning
2403.09195 Report	SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration	Yanfei Song, Bangzheng Pu, Peng Wang, Hongxu Jiang, Dong Dong, Yongxiang Cao, Yiqing Shen	Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/.	This paper introduces SAM-Lightening, a lightweight version of the Segment Anything Model (SAM) that achieves a 30x speedup in inference while maintaining segmentation accuracy.	The original SAM, while powerful, suffers from slow inference speeds and high computational demands, limiting its practical application in areas like AR and mobile deployment.	The authors achieve this by replacing the attention mechanism in SAM's image encoder with a novel Dilated Flash Attention mechanism and employing a dynamic layer-wise distillation technique for efficient knowledge transfer from the original SAM.	SAM-Lightening achieves an inference speed of 7 milliseconds per image for 1024x1024 resolution, outperforming prior state-of-the-art methods. It significantly reduces memory consumption, requiring only 3.5% of the memory used by the original SAM. The model maintains comparable segmentation accuracy to the original SAM, even on complex datasets like LVIS.	The impact of FlashAttention on inference speed is dependent on hardware and input size, sometimes resulting in slightly slower inference. Future work could explore integrating pruning and quantization techniques for further optimization.	segment anything model, knowledge distillation, efficient attention mechanisms, image segmentation, real-time processing
2403.09176 Report	Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts	Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, Changick Kim	Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising task at a specific noise level. While these efforts have focused on parameter isolation and task routing, they fall short of capturing detailed inter-task relationships and risk losing semantic information, respectively. In response, we introduce Switch Diffusion Transformer (Switch-DiT), which establishes inter-task relationships between conflicting tasks without compromising semantic information. To achieve this, we employ a sparse mixture-of-experts within each transformer block to utilize semantic information and facilitate handling conflicts in tasks through parameter isolation. Additionally, we propose a diffusion prior loss, encouraging similar tasks to share their denoising paths while isolating conflicting ones. Through these, each transformer block contains a shared expert across all tasks, where the common and task-specific denoising paths enable the diffusion model to construct its beneficial way of synergizing denoising tasks. Extensive experiments validate the effectiveness of our approach in improving both image quality and convergence rate, and further analysis demonstrates that Switch-DiT constructs tailored denoising paths across various generation scenarios.	This paper introduces Switch-DiT, a novel diffusion model architecture that improves image generation quality and training convergence by synergizing denoising tasks through a Sparse Mixture-of-Experts (SMoE) approach.	Existing diffusion models struggle to efficiently handle conflicting optimization directions among denoising tasks across different noise levels, leading to slow convergence and potentially lower image quality.	Switch-DiT integrates SMoE layers into each transformer block, using a timestep-based gating network to isolate parameters between conflicting tasks while sharing information through common denoising paths. It also introduces a diffusion prior loss to stabilize training and enforce inter-task relationships.	Switch-DiT consistently outperforms baseline DiT and DTR models in terms of FID, IS, Precision, and Recall across different model sizes on FFHQ and ImageNet datasets. It achieves faster convergence rates compared to baselines, indicating more efficient diffusion training. Analysis reveals that Switch-DiT constructs tailored denoising paths based on model size and dataset, demonstrating its adaptability to different generation scenarios.	The current implementation employs a fixed routing policy inherited from DTR, potentially limiting its ability to fully capture nuanced inter-task relationships. Future work includes exploring scalable SMoE configurations and adaptive routing policies tailored to specific generation scenarios to further enhance performance.	diffusion models, mixture-of-experts, multi-task learning, image generation, transformer
2403.09140 Report	Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior	Cheng Chen, Xiaofeng Yang, Fan Yang, Chengzeng Feng, Zhoujie Fu, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu	Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper, we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover, to ensure accurate appearances of different views, we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity. Our project page is available at: https://stellarcheng.github.io/Sculpt3D/.	Sculpt3D, a novel text-to-3D generation framework, explicitly integrates 3D shape and appearance priors from retrieved reference objects to enhance multi-view consistency without retraining the 2D diffusion model.	Existing text-to-3D methods often produce inconsistent appearances and inaccurate shapes due to relying solely on 2D diffusion supervision. Sculpt3D addresses this by effectively leveraging 3D priors while preserving the high quality of 2D diffusion models.	Sculpt3D retrieves semantically matching 3D templates and utilizes them in two ways: 1) Sparse keypoint supervision from the template guides 3D shape generation, allowing creative point growth and pruning during optimization. 2) An image adapter aligns the template's appearance with the generated object's style, then modulates the 2D diffusion output to correct appearance inconsistencies across views.	Sculpt3D generates high-fidelity 3D objects with superior multi-view consistency compared to previous state-of-the-art methods. The sparse keypoint supervision enables Sculpt3D to produce diverse shapes that adapt to the template while retaining the 2D diffusion model's creative freedom. The appearance modulation effectively corrects view-specific inconsistencies without altering the overall style or geometry of the generated object.	Sculpt3D's reliance on 3D priors can be limiting if the initial retrieved shape falls outside the scope of the dataset. Generating accurate initial shapes for retrieval remains challenging and presents an area for future improvement.	text-to-3d generation, multi-view consistency, 3d prior, retrieval augmentation, diffusion models
2403.09093 Report	Desigen: A Pipeline for Controllable Design Template Generation	Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen	Templates serve as a good starting point to implement a design (e.g., banner, slide) but it takes great effort from designers to manually create. In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images, a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control, we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background, we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition, we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design, we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at https://whaohan.github.io/desigen.	Presents "Desigen", an automatic design template creation pipeline that generates both background images and harmonious layout elements using text descriptions and layout specifications.	Automates the laborious process of manual design template creation, enabling efficient generation of visually appealing and accessible designs.	Utilizes a diffusion-based background generator with spatial control mechanisms (salient attention constraint and attention reduction), followed by a Transformer-based layout generator. An iterative inference strategy refines both background and layout for harmonious composition.	Generates backgrounds with significantly lower salient ratios compared to baseline T2I models, indicating more space for layout elements. Synthesizes layouts that achieve superior visual accessibility (lower occlusion with backgrounds) while maintaining good alignment and minimal overlap. Demonstrates the capability to generate theme-consistent slide decks by varying layout masks while maintaining relevant background content.	Current implementation primarily focuses on simple layouts with a limited number of elements. Further exploration of incorporating graphic design principles for enhanced aesthetics and usability.	design template generation, text-to-image synthesis, layout generation, spatial control, diffusion models
2403.09065 Report	When Semantic Segmentation Meets Frequency Aliasing	Linwei Chen, Lin Gu, Ying Fu	Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored. Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries. In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling. To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing. While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns. Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency. The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block. Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks. The code is available at: https://github.com/Linwei-Chen/Seg-Aliasing.	This paper analyzes the phenomenon of aliasing in semantic segmentation and proposes two novel modules to address it: the de-aliasing filter (DAF) and the frequency mixing module (FreqMix).	Aliasing, a signal distortion arising from undersampling, poses significant challenges in semantic segmentation by hindering accurate boundary prediction. This paper aims to understand and mitigate this issue.	The paper introduces the concept of equivalent sampling rate (ESR) to accurately calculate the Nyquist frequency and quantifies aliasing levels using an 'aliasing score'. It proposes DAF to remove aliasing frequencies and FreqMix to dynamically balance low and high-frequency components during feature extraction.	The study reveals a strong positive correlation between hard-to-segment pixels and the proposed aliasing score. DAF, by accurately removing aliasing frequencies, consistently improves segmentation accuracy compared to traditional blur filters. FreqMix further enhances performance by dynamically balancing frequency components within the encoder block.	The equivalent sampling rate is a heuristic estimation and lacks rigorous theoretical guarantees. The research focuses on semantic segmentation, leaving its application to instance and panoptic segmentation unexplored.	semantic segmentation, aliasing, de-aliasing filter, frequency mixing, hard pixels
2403.09055 Report	StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control	Jaerin Lee, Daniel Sungho Jung, Kanggeon Lee, Kyoung Mu Lee	The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve $\times 10$ faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at https://github.com/ironjr/StreamMultiDiffusion.	This paper introduces StreamMultiDiffusion, the first real-time region-based text-to-image generation framework achieving a generation speed of 1.57 FPS on a single RTX 2080 Ti GPU.	Existing diffusion models struggle to simultaneously achieve fast inference and fine-grained control, limiting their real-world applicability. This framework aims to overcome this limitation by enabling real-time interactive image generation with region-based text prompts.	The paper stabilizes fast inference techniques like Latent Consistency Models (LCM) and restructures MultiDiffusion into a novel multi-prompt stream batch architecture. This pipeline processes multiple image regions with different text prompts concurrently, hiding latency and maximizing throughput.	StreamMultiDiffusion achieves x10 faster panorama generation compared to existing solutions. The framework stabilizes fast sampling in region-based generation, improving compatibility between LCM and MultiDiffusion. It introduces "semantic palette," a novel interactive image generation paradigm where users "paint" images in real-time using text prompts as brushes.	The current implementation still relies on a small number (4-6) of denoising steps. Achieving perfect mask-tight image synthesis remains a challenge despite improved fidelity with one-step white background bootstrapping.	diffusion models, real-time image generation, region-based image synthesis, interactive image editing, semantic palette
2403.08933 Report	Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images	Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara	Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.	This paper explores the differences in human gaze patterns when viewing real and partially manipulated images (created using diffusion models) to investigate whether human visual perception can contribute to fake image detection.	With the rise of advanced image generation techniques, detecting fake images, especially those subtly manipulated, is crucial to combat misinformation. This study explores the potential of leveraging human semantic knowledge for this task.	The authors collected a dataset of real images and generated three types of fake counterparts using different diffusion-based editing techniques. They conducted an eye-tracking experiment to record participants' gaze patterns while viewing real and fake images and statistically analyzed the collected data, focusing on saliency map entropy.	Human observers tend to focus on more confined regions when viewing fake images compared to more dispersed patterns observed with real images. Statistical analysis, including Kolmogorov-Smirnov, Cramér-von Mises, and Mann-Whitney U tests, reveals significant differences in saliency map entropy distributions between real and fake images, supporting the observed gaze pattern differences. The study suggests that human gaze information can potentially be integrated into automatic fake image detection systems to improve their accuracy.	The study primarily focuses on partially manipulated images, and future work should investigate if similar gaze patterns exist for entirely generated images. Further research is needed to develop concrete methods for incorporating human gaze information into existing fake detection models.	deepfakes, gaze tracking, visual perception, human in the loop, fake image detection
2403.08902 Report	Envision3D: One Image to 3D with Anchor Views Interpolation	Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E. H. Tay, Li Yuan	We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is still challenging for diffusion models to generate dense multi-view consistent images, which is crucial for the quality of 3D content extraction. To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation. In the first stage, we train the image diffusion model to generate global consistent anchor views conditioning on image-normal pairs. Subsequently, leveraging our video diffusion model fine-tuned on consecutive multi-view images, we conduct interpolation on the previous anchor views to generate extra dense views. This framework yields dense, multi-view consistent images, providing comprehensive 3D information. To further enhance the overall generation quality, we introduce a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly extract textured meshes from the generated dense images. Extensive experiments demonstrate that our method is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.	Envision3D is a novel method for generating high-quality 3D content from a single image by generating and leveraging dense, multi-view consistent images.	Generating 3D content from a single image is essential for various applications, and existing methods struggle to generate sufficiently dense and consistent multi-view images for high-quality 3D extraction.	The paper proposes a cascade diffusion framework. Stage 1 generates consistent anchor views and their normal maps using a multi-view attention mechanism, cross-domain attention, and an Instruction Representation Injection (IRI) module. Stage 2 interpolates between anchor views using a fine-tuned video diffusion model. Finally, a coarse-to-fine sampling strategy refines 3D content extraction using an SDF-based reconstruction method.	Envision3D generates denser and higher-quality multi-view consistent images compared to baselines. The method produces superior 3D content with higher fidelity texture and geometry compared to existing image-to-3D methods. Ablation studies confirm the effectiveness of increasing view count and using the proposed coarse-to-fine sampling strategy.	The reliance on a pre-trained normal prediction model in Stage 1 could limit generalization ability. Future work can explore alternative reconstruction methods or combine Envision3D with other 3D generation techniques to further enhance results.	3d generation, diffusion models, multi-view consistency, textured mesh, single image to 3d
2403.08857 Report	DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation	Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu	Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.	This paper introduces DialogGen, a pipeline to align Multi-modal Large Language Models (MLLMs) and Text-to-Image (T2I) models for multi-turn text-to-image generation in Multi-modal Interactive Dialogue Systems (MIDS), and DialogBen, a bilingual benchmark to evaluate such systems.	Effective interaction with T2I models is challenging for average users due to the need for specialized prompt engineering knowledge. Existing MLLMs face difficulties in identifying correct output modalities and generating coherent images in multi-turn settings.	DialogGen leverages drawing prompt alignment, curated bilingual training data, and error correction. DialogBen includes 9957 multi-modal conversations and evaluates modality switching accuracy and generation coherence using Visual Question Answering (VQA).	DialogGen achieves high modality switching accuracy, outperforming baselines like NExT-GPT and SEED-LLaMA. DialogGen with error correction significantly improves performance, especially with limited training data diversity. Bilingual training further enhances DialogGen's modality switching accuracy.	Resource requirement for re-captioning T2I training data can be high. Future work can explore aligning training data with human preferences.	text-to-image generation, multi-modal interactive dialogue systems, multi-modal large language models, benchmarking, error correction
2403.08551 Report	GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting	Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Yan Wang, Hongwei Qin, Guo Lu, Jing Geng, Jun Zhang	Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding. Code will be available at https://github.com/Xinjie-Q/GaussianImage.	Presents GaussianImage, a novel image representation and compression paradigm using 2D Gaussian Splatting, achieving faster rendering and less memory usage than INR methods.	Addresses limitations of Implicit Neural Representations (INRs) such as high GPU memory consumption, slow decoding speed, and long training times, hindering deployment on low-end devices.	Represents images using 2D Gaussians, each with 8 learnable parameters. Introduces an accumulated summation-based rasterization, replacing depth-based sorting and alpha-blending. Develops an image codec by integrating vector quantization for Gaussian attribute compression.	Achieves 1500-2000 FPS rendering speed regardless of parameter size, outperforming INR methods like WIRE and I-NGP. Requires 3x lower GPU memory than competitive INR methods while maintaining comparable image representation quality. Attains rate-distortion performance comparable to INR-based codecs like COIN and COIN++ with a significantly faster decoding speed around 1000 FPS.	Encoding speed is slower than VAE-based codecs, leaving room for improvement in image fitting and Gaussian compression. Current compression performance lags behind traditional/VAE-based codecs, necessitating development of specialized Gaussian-based compression algorithms.	2d gaussian splatting, image representation, image compression, neural image codec, fast rendering
2403.08498 Report	Gaussian Splatting in Style	Abhishek Saroha, Mariia Gladkova, Cecilia Curreli, Tarun Yenamandra, Daniel Cremers	Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime. This enables our method to be useful for vast practical use cases such as in augmented or virtual reality applications. Through our experiments, we show our methods achieve state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.	Introduces Gaussian Splatting in Style (GSS), a novel method for real-time neural scene stylization based on 3D Gaussian splatting.	Real-time scene stylization is crucial for applications like AR/VR, and existing methods often lack speed or consistency. GSS addresses this by leveraging the efficiency and explicit nature of 3D Gaussian representations.	GSS uses pre-trained 3D Gaussians and a 2D stylization module (AdaIN). It learns a mapping from Gaussian positions and style image latents to stylized RGB colors using a multi-resolution hash grid and a tiny MLP. This allows for view-dependent color prediction without sacrificing rendering speed.	GSS achieves state-of-the-art performance in short-term and long-term view consistency, outperforming NeRF-based methods. Qualitative results show GSS excels in preserving content details and faithfully transferring style features, surpassing baselines in accuracy and visual quality. GSS renders stylized novel views at approximately 157 FPS, significantly faster than other methods due to its efficient 3DGS backbone.	The current method relies on pre-trained 3D Gaussians, limiting its application to scenes with available 3DGS models. Further exploration of alternative 2D stylization techniques or incorporating semantic information could enhance stylization quality.	scene stylization, gaussian splatting, 3dgs, real-time rendering, novel view synthesis
2403.08436 Report	PFStorer: Personalized Face Restoration and Super-Resolution	Tuomas Varanka, Tapani Toivonen, Soumya Tripathy, Guoying Zhao, Erman Acar	Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper, we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity, leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization, the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images, a generative regularizer is employed. With a learnable parameter, the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover, we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities, demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the genereated details, with our method being voted best 61% of the time compared to the second best with 25% of the votes.	This paper proposes PFStorer, a personalized face restoration method using diffusion models that leverages a few high-quality reference images to restore low-quality face images while preserving identity.	Face restoration is ill-posed, with multiple plausible solutions. Existing methods often fail to retain the identity or generate fine-grained details, especially in challenging real-world scenarios.	PFStorer personalizes a pre-trained face restoration diffusion model by fine-tuning it with reference images. It utilizes independent trainable blocks for personalization, preserving the base model's priors. A generative regularizer forces the model to learn a robust identity representation solely from reference images. Additionally, the base model training pipeline is improved with an alignment-free approach and robust noise generation.	PFStorer outperforms state-of-the-art methods in preserving identity features, evidenced by quantitative metrics and a user study. The method demonstrates robustness in handling real-world degradations, variations in pose and illumination. Learnable adapters and the generative regularizer are crucial for balancing personalization and retaining restoration quality.	The output is limited by the quality and appearance variations present in the provided reference images. PFStorer inherits limitations of diffusion models, such as slow sampling speed and occasional artifacts.	face restoration, diffusion models, personalization, generative regularization, alignment-free
2403.08381 Report	Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models	Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie	Most diffusion models assume that the reverse process adheres to a Gaussian distribution. However, this approximation has not been rigorously validated, especially at singularities, where t=0 and t=1. Improperly dealing with such singularities leads to an average brightness issue in applications, and limits the generation of images with extreme brightness or darkness. We primarily focus on tackling singularities from both theoretical and practical perspectives. Initially, we establish the error bounds for the reverse process approximation, and showcase its Gaussian characteristics at singularity time steps. Based on this theoretical insight, we confirm the singularity at t=1 is conditionally removable while it at t=0 is an inherent property. Upon these significant conclusions, we propose a novel plug-and-play method SingDiffusion to address the initial singular time step sampling, which not only effectively resolves the average brightness issue for a wide range of diffusion models without extra training efforts, but also enhances their generation capability in achieving notable lower FID scores.	The paper proposes SingDiffusion, a plug-and-play method to address the singularity issue at the initial time step in diffusion models, which leads to an average brightness issue in generated images.	Most diffusion models ignore singularities at t=0 and t=1, resulting in an inability to generate images with extreme brightness and darkness. Existing solutions require model-specific retraining, limiting their practicality.	The authors prove the approximate Gaussian characteristics of the reverse diffusion process at all time steps. They analyze the singularities and devise SingDiffusion, which trains a separate model for the initial sampling step (t=1) using x-prediction and seamlessly integrates with existing pre-trained diffusion models for subsequent steps.	SingDiffusion effectively resolves the average brightness issue, allowing for the generation of both bright and dark images. SingDiffusion improves the FID scores of existing diffusion models, demonstrating enhanced image quality. SingDiffusion is a once-trained, plug-and-play module compatible with a wide range of pre-trained models and plugins like ControlNet.	The current training data only includes image-prompt pairs, limiting its application to other domains like audio generation. The normalization operation for classifier-free guidance at the initial time step could be further improved.	diffusion models, singularity, average brightness, image generation, plug-and-play
2403.08277 Report	VIGFace: Virtual Identity Generation Model for Face Image Synthesis	Minsoo Kim, Min-Cheol Sagong, Gi Pyo Nam, Junghyun Cho, Ig-Jae Kim	Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Initially, we train the face recognition model using a real face dataset and create a feature space for both real and virtual IDs where virtual prototypes are orthogonal to other prototypes. Subsequently, we generate synthetic images by using the diffusion model based on the feature space. Our proposed framework provides two significant benefits. Firstly, it allows for creating virtual facial images without concerns about portrait rights, guaranteeing that the generated virtual face images are clearly differentiated from existing individuals. Secondly, it serves as an effective augmentation method by incorporating real existing images. Further experiments demonstrate the efficacy of our framework, achieving state-of-the-art results from both perspectives without any external data.	Presents VIGFace, a novel framework for generating synthetic facial images of virtual identities for face recognition, addressing privacy concerns and data scarcity.	Real face datasets for face recognition raise privacy concerns, are costly to obtain, and can contain label inaccuracies and biases.	Trains a face recognition model with real data, incorporates virtual identity prototypes, and utilizes a diffusion model to generate synthetic images based on the feature space of the trained model.	Generated virtual face images demonstrate high intra-class variance and inter-class diversity. FR model trained solely on VIGFace virtual images achieves state-of-the-art performance, comparable to models trained on real datasets. Using VIGFace for data augmentation, by combining its generated images with real data, further improves FR model performance.	The paper focuses on frontal face images, and extending the approach to handle pose variations could be explored. Investigating the generalization capability of FR models trained on VIGFace to other downstream tasks or datasets is a potential future direction.	face recognition, diffusion model, image generation, data augmentation, synthetic data
2403.08268 Report	Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts	Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen	Despite recent advances in image-to-video generation, better controllability and local animation are less explored. Most existing image-to-video methods are not locally aware and tend to move the entire scene. However, human artists may need to control the movement of different objects or regions. Additionally, current I2V methods require users not only to describe the target motion but also to provide redundant detailed descriptions of frame contents. These two issues hinder the practical utilization of current I2V tools. In this paper, we propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click (for specifying what to move) and a short motion prompt (for specifying how to move). Technically, we propose the first-frame masking strategy, which significantly improves the video generation quality, and a motion-augmented module equipped with a short motion prompt dataset to improve the short prompt following abilities of our model. To further control the motion speed, we propose flow-based motion magnitude control to control the speed of target movement more precisely. Our framework has simpler yet precise user control and better generation performance than previous methods. Extensive experiments compared with 7 baselines, including both commercial tools and research methods on 8 metrics, suggest the superiority of our approach. Project Page: https://follow-your-click.github.io/	This paper introduces Follow-Your-Click, a novel framework for open-domain regional image animation controlled by a user click (specifying the region to animate) and a short motion prompt (describing the desired motion).	Current image-to-video generation methods lack local animation control, requiring detailed scene descriptions and struggling to follow short motion prompts. This limits their practical use for animators who need precise control over object motion.	The framework leverages a pre-trained image LDM and incorporates several key components: a user click converted to a mask using SAM, first-frame masking training for improved temporal consistency, a motion-augmented module trained on a short prompt dataset (WebVid-Motion) for enhanced prompt following, and flow-based motion magnitude control for accurate motion speed adjustment.	Follow-Your-Click demonstrates superior regional animation capabilities compared to existing open-sourced and commercial baselines, as shown in qualitative comparisons and quantitative evaluations using metrics like FVD, temporal consistency, and text alignment. Ablation studies validate the effectiveness of each proposed component, such as first-frame masking for enhanced temporal coherence and the motion-augmented module for improved short prompt following. The framework shows potential for applications like multi-region animation and integration with ControlNet for precise motion control using pose conditioning.	The approach still faces limitations in generating complex and large human motions, potentially due to dataset bias and the complexity of such movements. Future work could explore incorporating more sophisticated motion control mechanisms and expanding the diversity of motion in the training dataset.	image animation, text-to-video generation, diffusion models, regional control, short prompt following
2403.08255 Report	Make Me Happier: Evoking Emotions Through Image Diffusion Models	Qing Lin, Jingfeng Zhang, Yew Soon Ong, Mengmi Zhang	Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and data will be made public.	This paper introduces the novel problem of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while preserving the semantics and structures of original scenes.	Emotion-evoked image editing has applications in various fields, including treatment of psychological disorders, product commercialization, and artistic design.	This paper proposes EmoEditor, a novel diffusion model with a dual-branch architecture that integrates emotion-conditioned global context and local emotional cues from source images. It also introduces a new dataset EmoPair, consisting of 340,000 image pairs with emotion annotations.	EmoEditor outperforms existing image editing methods in human psychophysics experiments, successfully evoking desired emotions in viewers. The proposed method preserves structural coherence and semantic consistency with source images while effectively manipulating emotional content. EmoEditor generalizes to challenging scenarios, including within-valence emotion editing and transforming emotionally neutral images.	The model faces limitations in accurately handling fine-grained details on small faces within crowded scenes. Generating emotion-evoked images without exacerbating semantic and structural disparities between source and target images remains a challenge.	image generation, emotion ai, diffusion models, image editing, computer vision
2403.08108 Report	TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection	Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Fei Wen, Hugo Latapie, Mohsen Imani	Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to sub-optimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to re-calibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% and only requires a single NVIDIA RTX 4090 for both training and inference.	TaskCLIP, a novel two-stage framework for task-oriented object detection that leverages pre-trained Vision-Language Models (VLMs) for efficient and effective object selection.	Existing all-in-one models for task-oriented object detection suffer from data scarcity and imbalance, leading to capped performance, laborious training, and poor generalizability.	TaskCLIP first performs general object detection. Then, it leverages pre-trained VLMs like CLIP to match image patches with task-relevant visual attributes, generated by an LLM. A transformer-based aligner recalibrates the embedding space, and a score function guides object selection.	TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% mAP@0.5 on the COCO-Tasks dataset. It requires only a single NVIDIA RTX 4090 GPU for both training and inference, demonstrating higher training efficiency. A select-by-grouping mechanism effectively mitigates the high false negative rate caused by imbalanced training samples.	TaskCLIP can be sensitive to the quality of bounding boxes generated by the object detection network. The model might misidentify objects with misleading appearances even after embedding recalibration.	task-oriented object detection, vision-language models, clip, transformer, coco-tasks
2403.07874 Report	Beyond Text: Frozen Large Language Models in Visual Signal Comprehension	Lei Zhu, Fangyun Wei, Yanye Lu	In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer.	This paper introduces the Vision-to-Language Tokenizer (V2L Tokenizer), a novel approach that enables a frozen large language model (LLM) to comprehend and process visual signals directly without requiring fine-tuning on multi-modal datasets.	This method is crucial for expanding the capabilities of LLMs to encompass visual comprehension and generation without the need for resource-intensive fine-tuning.	The V2L Tokenizer translates images into a set of discrete tokens drawn from the LLM's vocabulary, viewing images as a "foreign language." It employs an encoder-decoder structure with two quantizers and leverages the LLM's vocabulary and CLIP for semantic mapping.	The V2L Tokenizer outperforms previous methods in few-shot image classification tasks, demonstrating its ability to enable LLMs to understand visual concepts. It excels in image denoising tasks like inpainting and deblurring, showcasing its capacity to generate high-quality visual content. The approach effectively bridges the gap between visual and language modalities, allowing LLMs to perform tasks like image captioning and visual question answering.	The performance of image generation tasks, while promising, can be further enhanced, potentially through LLM fine-tuning or alternative optimization strategies. The reliance on a pre-trained CLIP model introduces a dependency on external resources, and exploring CLIP-free alternatives could be a future direction.	large language models, vision-to-language, image understanding, image denoising, tokenization
2403.07860 Report	Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation	Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong	Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://github.com/ShihaoZhaoZSH/LaVi-Bridge.	This paper introduces LaVi-Bridge, a flexible pipeline for text-to-image generation that allows seamless integration of diverse pre-trained language and generative vision models.	The rapid progress in deep language and vision models poses a challenge for text-to-image generation in terms of integrating advanced models into existing text-to-image diffusion models. This paper bridges this gap by providing a framework for integrating any two unrelated language and vision models.	LaVi-Bridge leverages LoRA and adapters to establish connections between pre-trained language and vision models without modifying their original weights. This allows for a plug-and-play approach where different models can be easily swapped and tested.	Integrating superior models under LaVi-Bridge leads to improved performance, such as enhanced semantic understanding with advanced language models (e.g., Llama-2) or improved image quality with more powerful generative vision models (e.g., PixArt's transformer). The study demonstrated that LaVi-Bridge is compatible with various language model structures (encoder-only, encoder-decoder, decoder-only) and generative vision model structures (U-Net-based and Transformer-based). LaVi-Bridge requires only a relatively small dataset for fine-tuning the LoRA and adapter components, making it efficient in terms of training data and computational resources.	Training with LaVi-Bridge on the same models and weights as an existing text-to-image diffusion model may not lead to significant improvements and might even slightly decrease performance. The paper primarily focuses on combining existing models and does not delve into the exploration of novel language or vision models specifically designed for text-to-image generation.	text-to-image generation, diffusion models, language models, generative vision models, lora
2403.07764 Report	Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model	Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, Haibo Zhao	Current makeup transfer methods are limited to simple makeup styles, making them difficult to apply in real-world scenarios. In this paper, we introduce Stable-Makeup, a novel diffusion-based makeup transfer method capable of robustly transferring a wide range of real-world makeup, onto user-provided faces. Stable-Makeup is based on a pre-trained diffusion model and utilizes a Detail-Preserving (D-P) makeup encoder to encode makeup details. It also employs content and structural control modules to preserve the content and structural information of the source image. With the aid of our newly added makeup cross-attention layers in U-Net, we can accurately transfer the detailed makeup to the corresponding position in the source image. After content-structure decoupling training, Stable-Makeup can maintain content and the facial structure of the source image. Moreover, our method has demonstrated strong robustness and generalizability, making it applicable to varioustasks such as cross-domain makeup transfer, makeup-guided text-to-image generation and so on. Extensive experiments have demonstrated that our approach delivers state-of-the-art (SOTA) results among existing makeup transfer methods and exhibits a highly promising with broad potential applications in various related fields.	Stable-Makeup, a novel diffusion-based makeup transfer method that robustly transfers diverse real-world makeup styles onto user-provided faces, addressing limitations of existing GAN-based methods in handling high-detail and creative cosmetics.	Existing makeup transfer methods struggle with complex real-world makeup styles, limiting their practicality for diverse and intricate designs. This work aims to overcome this limitation and enable high-quality makeup transfer for a broader range of styles.	Stable-Makeup leverages a pre-trained diffusion model and introduces: 1) Detail-Preserving Makeup Encoder to capture intricate makeup details, 2) Makeup Cross-attention Layers to align makeup features with facial regions, 3) Content and Structural Control Modules to maintain source image fidelity. The method is trained on a newly created dataset of 20k paired images with diverse makeup styles.	Stable-Makeup demonstrates state-of-the-art performance, outperforming existing methods in transferring both light and heavy makeup with superior detail preservation. Quantitative evaluations using CLIP-I, DINO-I, SSIM, and L2-M metrics confirm superior makeup transfer capability and content-structure preservation. User studies validate the perceptual quality of Stable-Makeup, highlighting its ability to generate realistic and aesthetically pleasing makeup transfer results.	Potential inconsistencies in facial structure within the training dataset, arising from limitations of text-based editing methods, might impact the model's performance. Future work includes refining data selection and exploring 3D makeup transfer to further enhance the method's capabilities.	makeup transfer, diffusion models, detail preservation, content-structure control, real-world makeup
2403.07711 Report	SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces	Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, Yutaka Matsuo	Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with the length of the sequence. This limitation presents significant challenges when attempting to generate longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101, a standard benchmark of video generation. In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate dataset, varying the number of frames to 64, 200, and 400. In these settings, our SSM-based model can considerably save memory consumption for longer sequences, while maintaining competitive FVD scores to the attention-based models. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.	This paper introduces a novel temporal state-space model (SSM) layer to replace the memory-intensive attention mechanism in video diffusion models (VDMs) for efficient video generation.	Existing VDMs heavily rely on attention layers for capturing temporal features, leading to quadratic memory consumption with sequence length, hindering longer video generation.	The proposed temporal SSM layer leverages bidirectional SSMs to capture comprehensive temporal dynamics, augmented by a multi-layer perceptron (MLP) to enhance information integration across dimensions.	The SSM-based VDM achieves competitive or superior video generation quality (FVD score) compared to attention-based models on UCF101. The SSM-based model demonstrates superior memory efficiency, enabling training on 400-frame MineRL Navigate videos, while attention-based methods fail due to memory limitations. Ablation studies highlight the critical role of bidirectional SSMs and MLPs in achieving high-quality video generation.	The study primarily focuses on unconditional video generation, leaving extensions to conditional generation for future work. Exploring alternative SSM architectures and their impact on long-term video generation is a promising research direction.	video generation, diffusion models, state-space models, attention mechanism, memory efficiency
2403.07605 Report	Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation	Michael Ogezi, Ning Shi	In text-to-image generation, using negative prompts, which describe undesirable image characteristics, can significantly boost image quality. However, producing good negative prompts is manual and tedious. To address this, we propose NegOpt, a novel method for optimizing negative prompt generation toward enhanced image generation, using supervised fine-tuning and reinforcement learning. Our combined approach results in a substantial increase of 25% in Inception Score compared to other approaches and surpasses ground-truth negative prompts from the test set. Furthermore, with NegOpt we can preferentially optimize the metrics most important to us. Finally, we construct Negative Prompts DB, a dataset of negative prompts.	NegOpt, a novel method for optimizing negative prompts in text-to-image generation, aiming to improve image quality by guiding the model away from undesirable characteristics.	Generating high-quality negative prompts manually is tedious and challenging. This method automates the process and leads to significant improvements in image aesthetics and fidelity.	A two-step approach: 1) Fine-tuning a sequence-to-sequence model on a dataset of normal and corresponding negative prompts, 2) Employing reinforcement learning to further optimize the model based on a reward function that considers aesthetics, prompt alignment, and image fidelity.	Achieves a 25% increase in Inception Score compared to other methods, indicating improved image quality and diversity. Outperforms ground-truth negative prompts from the test set, demonstrating the model's ability to learn effective patterns. Allows for preferential optimization of specific image qualities, such as aesthetics, by adjusting the weights in the reward function.	The dataset used to train the model may contain inherent biases, potentially leading to biased image generation. There is a risk of misuse, where the method could be exploited to generate harmful or misleading content.	text-to-image generation, negative prompts, prompt optimization, reinforcement learning, image quality
2403.07589 Report	PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution	Honghao Chen, Xiangxiang Chu, Yongjian Ren, Xin Zhao, Kaiqi Huang	Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.	Proposes Peripheral Convolution, a new convolution form inspired by human peripheral vision, to reduce parameter complexity in large kernel CNNs, enabling extremely large kernels (e.g., 101x101).	Large kernel CNNs are effective but suffer from quadratic parameter complexity, limiting their scalability. Peripheral convolution addresses this by efficiently reducing parameters while maintaining performance.	Peripheral convolution uses parameter sharing in the kernel's peripheral regions with exponentially increasing granularity, mimicking human vision. It also incorporates kernel-wise positional embedding to compensate for detail blurring caused by sharing.	Dense grid convolution consistently outperforms stripe convolution across different kernel sizes. PeLK, built on peripheral convolution, achieves state-of-the-art performance on ADE20K, MS COCO, and ImageNet, surpassing Swin Transformer and ConvNeXt. Peripheral convolution enables scaling kernel size to 101x101 with consistent performance gains, demonstrating its effectiveness.	Exploring even larger kernel sizes and input resolutions might be computationally expensive. The optimal kernel size configuration might need adjustments based on specific tasks and datasets.	convolutional neural networks, large kernel convolution, peripheral vision, parameter efficiency, effective receptive field
2403.07547 Report	SMURF: Continuous Dynamics for Motion-Deblurring Radiance Fields	Jungho Lee, Dogyoon Lee, Minhyeok Lee, Donghyung Kim, Sangyoun Lee	Neural radiance fields (NeRF) has attracted considerable attention for their exceptional ability in synthesizing novel views with high fidelity. However, the presence of motion blur, resulting from slight camera movements during extended shutter exposures, poses a significant challenge, potentially compromising the quality of the reconstructed 3D scenes. While recent studies have addressed this issue, they do not consider the continuous dynamics of camera movements during image acquisition, leading to inaccurate scene reconstruction. Additionally, these methods are plagued by slow training and rendering speed. To effectively handle these issues, we propose sequential motion understanding radiance fields (SMURF), a novel approach that employs neural ordinary differential equation (Neural-ODE) to model continuous camera motion and leverages the explicit volumetric representation method for faster training and robustness to motion-blurred input images. The core idea of the SMURF is continuous motion blurring kernel (CMBK), a unique module designed to model a continuous camera movements for processing blurry inputs. Our model, rigorously evaluated against benchmark datasets, demonstrates state-of-the-art performance both quantitatively and qualitatively.	This paper introduces SMURF, a novel method leveraging continuous dynamics for reconstructing sharp 3D scenes from motion-blurred images using neural radiance fields.	Existing methods for handling motion blur in NeRF either neglect the continuous nature of camera motion or suffer from slow training and rendering speeds. SMURF addresses both limitations.	SMURF employs a continuous motion blur kernel (CMBK) based on Neural-ODEs to model camera motion as a continuous function. It utilizes a tensor factorization-based representation (TensoRF) for faster training and robustness to blur. Two regularization techniques ensure accurate ray warping.	SMURF achieves state-of-the-art quantitative results on synthetic and real-world datasets, outperforming previous methods in PSNR, SSIM, and LPIPS. The method significantly reduces training and rendering time compared to existing techniques. Qualitative evaluation through novel view rendering demonstrates SMURF's ability to reconstruct detailed 3D scenes and accurately restore sharp features.	The current backbone, while faster than some, could be further sped up using newer rasterization-based methods like 3D Gaussian Splatting. Future work could explore extending the continuous dynamics approach to handle object motion blur in addition to camera motion blur.	neural radiance fields, motion deblurring, continuous dynamics, neural odes, view synthesis
2403.07508 Report	MoAI: Mixture of All Intelligence for Large Language and Vision Models	Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro	The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.	Introduces MoAI, a new large language and vision model that leverages auxiliary visual information from external CV models and blends three types of intelligence: visual features, auxiliary features, and language features.	Current LLVMs overlook the detailed real-world scene understanding offered by specialized CV models. MoAI aims to address this by incorporating these models to enhance visual perception capabilities in VL tasks.	MoAI utilizes a MoAI-Compressor to process and condense verbalized outputs from external CV models (segmentation, detection, SGG, OCR). A MoAI-Mixer, inspired by MoE, then blends these auxiliary features with visual and language features from the backbone MLM.	MoAI significantly outperforms open-source and closed-source LLVMs in zero-shot VL tasks, particularly those requiring real-world scene understanding. It achieves this without increasing model size or curating additional visual instruction tuning datasets. Ablation studies confirm the importance of each external CV model and the effectiveness of the MoAI-Compressor and MoAI-Mixer.	MoAI is currently tailored for real-world scene understanding and could be extended to incorporate more CV models for broader capabilities. Future work includes incorporating robust, unbiased, and explainable CV models for more precise and reliable outputs.	large language and vision models, mixture of experts, computer vision, real-world scene understanding, visual perception
2403.07500 Report	Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image Generation	Likun Li, Haoqi Zeng, Changpeng Yang, Haozhe Jia, Di Xu	The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method.	This paper proposes block-wise Low-Rank Adaptation (LoRA) for Stable Diffusion, which selectively fine-tunes specific blocks of the model for improved personalization and stylization in text-to-image generation.	Existing efficient fine-tuning methods, particularly LoRA, struggle to effectively combine personalization (e.g., a specific person's face) and stylization (e.g., a cartoon style) in generated images.	The authors divide the Stable Diffusion U-Net into blocks (in-blocks, mid-block, out-blocks) and selectively apply LoRA fine-tuning to different blocks, exploring which block combinations yield the best results for combining character identity and artistic style.	Block-wise LoRA outperforms standard LoRA and LoCon in generating images with consistent personalized identities and stylized appearances. Fine-tuning the top input and output blocks of the U-Net with style LoRA, while using full-block LoRA for character identity, achieved the best balance between personalization and stylization. The study provides insights into the roles of different U-Net blocks in the image generation process, showing that bottom blocks are less important for preserving target information.	The work primarily focuses on LoRA and could explore applying block-wise fine-tuning to other PEFT methods. The impact of different block combinations on generation quality needs further investigation to develop a more principled approach for block selection.	text-to-image generation, stable diffusion, personalization, stylization, low-rank adaptation (lora)
2403.07494 Report	SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM	Siting Zhu, Renjie Qin, Guangming Wang, Jiuming Liu, Hesheng Wang	We propose SemGauss-SLAM, the first semantic SLAM system utilizing 3D Gaussian representation, that enables accurate 3D semantic mapping, robust camera tracking, and high-quality rendering in real-time. In this system, we incorporate semantic feature embedding into 3D Gaussian representation, which effectively encodes semantic information within the spatial layout of the environment for precise semantic scene representation. Furthermore, we propose feature-level loss for updating 3D Gaussian representation, enabling higher-level guidance for 3D Gaussian optimization. In addition, to reduce cumulative drift and improve reconstruction accuracy, we introduce semantic-informed bundle adjustment leveraging semantic associations for joint optimization of 3D Gaussian representation and camera poses, leading to more robust tracking and consistent mapping. Our SemGauss-SLAM method demonstrates superior performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in novel-view semantic synthesis and 3D semantic mapping.	This supplementary material provides further details and experimental results for SemGauss-SLAM, a dense semantic SLAM system using Gaussian Splatting.	The work addresses the limitations of existing dense SLAM methods by introducing semantic information to improve accuracy and efficiency in 3D reconstruction and semantic mapping.	SemGauss-SLAM leverages a 3D Gaussian scene representation and incorporates semantic information into the bundle adjustment process for joint optimization of camera poses and scene representation.	The method achieves state-of-the-art performance on Replica and ScanNet datasets, demonstrating significant improvement in tracking accuracy and semantic segmentation compared to existing methods. It maintains a fast runtime, outperforming other radiance field-based SLAM methods while providing semantic mapping capabilities. The proposed approach achieves high-quality reconstruction, capturing fine details and exhibiting smoother surfaces compared to baselines.	The authors acknowledge that the reliance on a limited number of semantic categories poses a constraint on the system's applicability to more diverse environments. Future work will focus on incorporating object-level semantic understanding and exploring dynamic scene reconstruction.	slam, semantic slam, gaussian splatting, 3d reconstruction, semantic mapping
2403.07487 Report	Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM	Zeyu Zhang, Akide Liu, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang	Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging. Recent advancements in state space models (SSMs), notably Mamba, have showcased considerable promise in long sequence modeling with an efficient hardware-aware design, which appears to be a promising direction to build motion generation model upon it. Nevertheless, adapting SSMs to motion generation faces hurdles since the lack of a specialized design architecture to model motion sequence. To address these challenges, we propose Motion Mamba, a simple and efficient approach that presents the pioneering motion generation model utilized SSMs. Specifically, we design a Hierarchical Temporal Mamba (HTM) block to process temporal data by ensemble varying numbers of isolated SSM modules across a symmetric U-Net architecture aimed at preserving motion consistency between frames. We also design a Bidirectional Spatial Mamba (BSM) block to bidirectionally process latent poses, to enhance accurate motion generation within a temporal frame. Our proposed method achieves up to 50% FID improvement and up to 4 times faster on the HumanML3D and KIT-ML datasets compared to the previous best diffusion-based method, which demonstrates strong capabilities of high-quality long sequence motion modeling and real-time human motion generation. See project website https://steve-zeyu-zhang.github.io/MotionMamba/	This paper presents Motion Mamba, a novel framework for efficient and long-sequence human motion generation using selective state space models (SSMs).	Existing motion generation models, particularly diffusion-based ones, struggle with long-range sequence generation and suffer from slow inference speeds. Motion Mamba addresses these limitations.	The model utilizes a U-Net architecture with novel Hierarchical Temporal Mamba (HTM) blocks for temporal modeling and Bidirectional Spatial Mamba (BSM) blocks for enhanced spatial representation learning. It leverages the efficiency of SSMs for long-sequence modeling and fast inference.	Motion Mamba achieves up to 50% improvement in FID scores compared to previous state-of-the-art methods. It demonstrates significantly faster inference speeds, being up to 4 times faster than prior approaches. The effectiveness of the proposed framework is validated through comprehensive experiments and user studies on benchmark datasets like HumanML3D and KIT-ML.	The model's performance could be further investigated under more complex and diverse motion generation scenarios. Exploring the integration of additional modalities, such as audio or visual cues, could enhance the model's generative capabilities.	human motion generation, selective state space models, latent diffusion models, long-sequence modeling, efficient inference
2403.07392 Report	ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions	Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, Yifeng Shi	Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.	This paper presents ViT-CoMer, a plain, pre-training-free, feature-enhanced ViT backbone for dense prediction tasks by facilitating bidirectional interaction between CNN and transformer.	ViT doesn't perform well on dense prediction tasks due to the lack of inner-patch information interaction and limited feature scale diversity. Existing solutions introduce extra pre-training costs.	ViT-CoMer integrates a multi-scale convolutional feature interaction module, including MRFP to provide multi-scale spatial information and CTI for bidirectional multi-scale feature fusion between CNN and Transformer.	ViT-CoMer outperforms existing ViT-based methods and achieves comparable results to state-of-the-art methods on object detection, instance segmentation, and semantic segmentation. It effectively leverages various open-source pre-trained ViT weights for improved performance. ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data and 62.1% mIoU on ADE20K val, comparable to SOTA methods.	The improvement from integrating the approach with hierarchical vision transformers like Swin is less significant compared to plain ViT. Future work could explore more efficient interaction mechanisms between CNN and Transformer.	vision transformer, dense prediction, object detection, instance segmentation, semantic segmentation
2403.07371 Report	Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models	Phuong Dam, Jihoon Jeong, Anh Tran, Daeyoung Kim	This study discusses the critical issues of Virtual Try-On in contemporary e-commerce and the prospective metaverse, emphasizing the challenges of preserving intricate texture details and distinctive features of the target person and the clothes in various scenarios, such as clothing texture and identity characteristics like tattoos or accessories. In addition to the fidelity of the synthesized images, the efficiency of the synthesis process presents a significant hurdle. Various existing approaches are explored, highlighting the limitations and unresolved aspects, e.g., identity information omission, uncontrollable artifacts, and low synthesis speed. It then proposes a novel diffusion-based solution that addresses garment texture preservation and user identity retention during virtual try-on. The proposed network comprises two primary modules - a warping module aligning clothing with individual features and a try-on module refining the attire and generating missing parts integrated with a mask-aware post-processing technique ensuring the integrity of the individual's identity. It demonstrates impressive results, surpassing the state-of-the-art in speed by nearly 20 times during inference, with superior fidelity in qualitative assessments. Quantitative evaluations confirm comparable performance with the recent SOTA method on the VITON-HD and Dresscode datasets.	This paper introduces a novel diffusion-based virtual try-on method that excels in preserving both garment texture and user identity while being significantly faster than previous state-of-the-art methods.	Virtual Try-On is crucial for e-commerce and the metaverse, but existing solutions struggle to balance garment detail, user identity preservation, and synthesis speed.	The proposed method utilizes a two-module architecture with a warping module for aligning garments and a try-on module for refinement and missing part generation. A mask-aware post-processing technique ensures identity preservation and artifact reduction.	The method achieves state-of-the-art results in qualitative evaluations, demonstrating superior detail and identity preservation compared to previous methods. Quantitative results show comparable or better performance than existing methods on standard benchmarks (VITON-HD and DressCode). The proposed method is significantly faster than the current state-of-the-art, achieving an inference speed over 17 times faster.	The method relies on a relatively complex post-processing step, which could be streamlined in future work. Future research could focus on generalizing the approach to a wider range of clothing styles and body types.	virtual try-on, diffusion models, identity preservation, time efficiency, mask-aware post-processing
2403.07304 Report	Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models	Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang	Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM. This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. We decouple the LMM's learning of perception capabilities into task-agnostic and task-specific stages. Lumen first promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all the tasks we address in this paper. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Benefiting from such a decoupled design, our Lumen surpasses existing LMM-based approaches on the COCO detection benchmark with a clear margin and exhibits seamless scalability to additional visual tasks. Furthermore, we also conduct comprehensive ablation studies and generalization evaluations for deeper insights. The code will be released at https://github.com/SxJyJay/Lumen.	This paper presents Lumen, a Large multimodal model that enhances the vision-centric capabilities of LMMs by decoupling task-agnostic and task-specific learning.	Existing LMMs are limited in their ability to perform diverse vision-centric tasks due to their reliance on language-oriented output formats and lack of focus on intrinsic visual task characteristics.	Lumen first performs task-agnostic vision-language dense alignment by matching instructions with image regions, generating a heatmap. Then, lightweight, task-specific decoders use this heatmap to generate final outputs for tasks like object detection, segmentation, and pose estimation.	Lumen significantly outperforms existing LMM-based methods on object detection and achieves comparable results to specialist models on other tasks. It demonstrates strong generalization ability, performing well on unseen datasets and tasks like object counting. Ablation studies validate the importance of the multi-task training, dense alignment architecture, and input size choices.	The convergence speed may be limited by the optimization difficulty of using a single special token for querying image regions. Future work could explore vision encoders that can handle high-resolution inputs while maintaining semantic coherence with language modalities.	large multimodal models, vision-centric capabilities, object detection, instance segmentation, pose estimation
2403.07234 Report	It's All About Your Sketch: Democratising Sketch Control in Diffusion Models	Subhadeep Koley, Ayan Kumar Bhunia, Deeptanshu Sekhri, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song	This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.	This paper introduces an abstraction-aware framework for sketch-conditioned image generation using diffusion models. It enables accurate image generation from amateur sketches, moving beyond the limitations of existing methods that rely on precise edgemaps or textual prompts.	Existing sketch-to-image diffusion models often produce deformed outputs from freehand sketches due to their reliance on spatial conditioning. They also heavily depend on textual prompts, which can be limiting and lead to trade-offs between text coherence and sketch fidelity.	The proposed framework utilizes a sketch adapter to convert input sketches into equivalent textual embeddings, guiding the denoising process through cross-attention. An adaptive time-step sampling strategy caters to different sketch abstraction levels, and a discriminative guidance mechanism leverages a pre-trained fine-grained sketch-based image retrieval model to enhance sketch-photo association.	The method successfully generates photorealistic images from amateur sketches without relying on textual prompts during inference. It outperforms existing sketch-to-image generation methods in terms of FID-C, FGM, and MOS, demonstrating superior generation quality and sketch fidelity. The framework shows strong generalization ability, successfully handling sketches from unseen datasets, diverse stroke styles, and partially complete sketches.	The model may struggle with categorical ambiguity when similar-looking objects have abstract or deformed sketches. Future work could explore incorporating class labels or additional conditioning signals to mitigate this limitation.	sketch-to-image generation, diffusion models, abstraction-aware, discriminative guidance, generative ai
2403.07071 Report	LISO: Lidar-only Self-Supervised 3D Object Detection	Stefan Baur, Frank Moosmann, Andreas Geiger	3D object detection is one of the most important components in any Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors require costly & slow manual annotation of 3D bounding boxes to perform well. Recently, several methods emerged to generate pseudo ground truth without human supervision, however, all of these methods have various drawbacks: Some methods require sensor rigs with full camera coverage and accurate calibration, partly supplemented by an auxiliary optical flow engine. Others require expensive high-precision localization to find objects that disappeared over multiple drives. We introduce a novel self-supervised method to train SOTA lidar object detection networks which works on unlabeled sequences of lidar point clouds only, which we call trajectory-regularized self-training. It utilizes a SOTA self-supervised lidar scene flow network under the hood to generate, track, and iteratively refine pseudo ground truth. We demonstrate the effectiveness of our approach for multiple SOTA object detection networks across multiple real-world datasets. Code will be released.	This paper introduces LISO, a novel self-supervised learning method for 3D object detection using only LiDAR point cloud sequences.	Current state-of-the-art LiDAR object detectors heavily rely on expensive and time-consuming manual annotations of 3D bounding boxes. LISO aims to overcome this limitation by providing a self-supervised training approach.	LISO leverages a self-supervised LiDAR scene flow network to generate initial pseudo ground truth (pgt) of moving objects. This pgt is iteratively refined through a trajectory-regularized self-training process which trains a single-frame object detector.	LISO outperforms existing self-supervised methods on four different autonomous driving datasets (Waymo Open Dataset, KITTI, Argoverse 2, and Nuscenes). The method demonstrates its ability to generalize from detecting moving objects to detecting movable objects. Ablation studies confirm the importance of motion cues from scene flow and trajectory-regularized self-training for achieving good performance.	LISO currently lacks the ability to distinguish between different object classes. Future work could focus on generating class labels for the detected objects, potentially by incorporating motion or size characteristics.	self-supervised learning, lidar, object detection, 3d object detection, autonomous driving
2403.06977 Report	VideoMamba: State Space Model for Efficient Video Understanding	Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao	Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at https://github.com/OpenGVLab/VideoMamba.	This paper proposes VideoMamba, a purely State Space Model (SSM)-based video understanding model inspired by Mamba for NLP, offering linear complexity for efficient long-term video modeling.	Existing methods like 3D CNNs and video transformers struggle to address both local redundancy and global dependencies in video understanding, particularly for long, high-resolution videos. VideoMamba offers a more efficient and scalable solution.	VideoMamba adapts the bidirectional Mamba block to process 3D video sequences, introducing a novel self-distillation technique to enhance scalability and exploring various spatiotemporal scan methods.	VideoMamba achieves state-of-the-art results on ImageNet-1K with 84.0% top-1 accuracy, outperforming isotropic architectures by significant margins. It outperforms attention-based methods on Kinetics-400 and Something-Something V2, demonstrating effectiveness in both scene-related and temporal-related action recognition. VideoMamba shows significant superiority over feature-based methods on long-term video understanding benchmarks (Breakfast, COIN, LVU), achieving state-of-the-art performance with end-to-end training.	Scalability of VideoMamba has not been fully explored, such as extending to larger model sizes and integrating with other modalities or large language models. Further validation is needed for hour-level video understanding tasks.	video understanding, state space model, mamba, long-term video modeling, self-distillation
2403.06976 Report	BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion	Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu	Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.	This paper introduces BrushNet, a plug-and-play image inpainting model that leverages a dual-branch diffusion approach to enhance semantic consistency and image quality.	Existing diffusion-based image inpainting methods often struggle with semantic mismatches and reduced image quality due to limitations in mask processing and information integration.	BrushNet employs a dual-branch architecture: one branch processes noisy latent features, while the other extracts masked image features using a VAE encoder and a frozen pre-trained diffusion model without text cross-attention. These features are then hierarchically integrated into the main diffusion model for coherent inpainting. A blurred blending strategy is also introduced to improve the preservation of unmasked regions.	BrushNet outperforms previous state-of-the-art methods on both random and segmentation-based inpainting tasks, as demonstrated by quantitative evaluations using Image Reward, HPS v2, Aesthetic Score, PSNR, LPIPS, MSE, and CLIP Similarity metrics. The dual-branch design allows for flexible control over the inpainting process, including the choice of base diffusion model and the level of unmasked region preservation. BrushNet demonstrates strong generalization across various image domains, including natural images, paintings, anime, and illustrations.	The quality of inpainted images is dependent on the base diffusion model used. Unusually shaped masks or misaligned text prompts can still pose challenges for the model.	image inpainting, diffusion models, image generation, plug-and-play, dual-branch diffusion
2403.06973 Report	Bayesian Diffusion Models for 3D Shape Reconstruction	Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, Zhuowen Tu	We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM brings in rich prior information from standalone labels (e.g. point clouds) to improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian frameworks where explicit prior and likelihood are required for the inference, BDM performs seamless information fusion via coupled diffusion processes with learned gradient computation networks. The specialty of our BDM lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction.	Presents Bayesian Diffusion Models (BDM), a novel statistical inference algorithm that couples diffusion-based bottom-up (data-driven) and top-down (prior) processes for improved 3D shape reconstruction.	Addresses the limitations of traditional Bayesian inference methods in leveraging large-scale datasets and complex deep learning models, particularly in scenarios with limited paired data-labels.	Introduces two fusion strategies: BDM-M (Merging), a learnable paradigm that implicitly merges knowledge from prior and reconstruction models, and BDM-B (Blending), a training-free method that explicitly combines point clouds from both processes.	Demonstrates state-of-the-art results on synthetic (ShapeNet-R2N2) and real-world (Pix3D) 3D shape reconstruction benchmarks. Shows significant improvement over baseline methods, particularly when training data for reconstruction is scarce. Ablation studies confirm the effectiveness of prior integration timing, duration, and ratio in enhancing reconstruction quality.	BDM currently requires both prior and data-driven processes to be diffusion-based. The explicit point cloud representation used in BDM-B may limit its applicability to implicit representations.	bayesian inference, diffusion models, 3d shape reconstruction, prior integration, deep learning
2403.06952 Report	SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data	Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal	Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.	This paper introduces SELMA, a novel paradigm that leverages automatically generated, multi-skill image-text datasets to improve the faithfulness of text-to-image (T2I) generation models.	Existing T2I models often struggle to generate images that precisely match the details of text inputs. SELMA addresses this by fine-tuning models with skill-specific expert learning and merging, enabling more accurate image generation.	SELMA uses a four-stage pipeline: (1) Skill-specific prompt generation using an LLM, (2) Image generation from these prompts using a T2I model, (3) Fine-tuning the T2I model with skill-specific LoRA experts on these image-text pairs, and (4) Merging the LoRA experts to obtain a multi-skill T2I model.	SELMA significantly improves the faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks and human preference metrics. Fine-tuning with SELMA's auto-collected image-text pairs shows comparable performance to fine-tuning with ground truth data. Fine-tuning with images from a weaker T2I model can enhance a stronger T2I model's generation quality, indicating weak-to-strong generalization potential.	SELMA relies on a strong image generator and an instruction-following LLM. While SELMA enhances text-image alignment, it doesn't guarantee that the resulting model will follow every detail of the text prompts.	text-to-image generation, faithfulness, lora, expert merging, synthetic data
2403.06951 Report	DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations	Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang	The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/.	DEADiff is introduced, an encoder-based diffusion model for stylized image generation that maintains text controllability through style and semantic decoupling.	Existing encoder-based methods for style transfer in diffusion models often compromise the model's ability to accurately follow text prompts due to semantic interference from the style image.	DEADiff uses two Q-Formers with a non-reconstructive learning paradigm to extract disentangled style and content representations, injecting them into separate cross-attention layers of the diffusion U-Net.	DEADiff successfully generates stylized images while remaining faithful to text prompts, surpassing previous methods in balancing style accuracy and text controllability. Quantitative and qualitative comparisons, including a user study, demonstrate DEADiff's superior performance in generating high-quality stylized images that adhere to text prompts. Ablation studies confirm the contribution of each component in DEADiff, highlighting the importance of style and semantic decoupling for effective stylized image generation with text control.	Future work could focus on further improving style similarity to match the reference image more closely. Exploring the decoupling of more granular, instance-level semantic information is another promising direction.	stylized image generation, text-to-image synthesis, diffusion models, style and semantic decoupling, text controllability
2403.06912 Report	DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization	Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu	Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views, yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian radiance fields, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting, despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields, we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently, we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry reshaping, we introduce Global-Local Depth Normalization, enhancing the focus on small local depth changes. Extensive experiments on LLFF, DTU, and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods, achieving comparable or better results with significantly reduced memory cost, a $25 \times$ reduction in training time, and over $3000 \times$ faster rendering speed.	This paper introduces DNGaussian, a novel view synthesis method using depth-regularized 3D Gaussian radiance fields for real-time, high-quality results with low training costs.	Existing radiance field methods for novel view synthesis are computationally expensive and slow, while recent 3D Gaussian Splatting, though efficient, suffers geometry degradation with sparse input views.	DNGaussian leverages monocular depth estimates to regularize the 3D Gaussian field using: (1) Hard and Soft Depth Regularization to refine Gaussian positions and opacities and (2) Global-Local Depth Normalization to prioritize small, local depth variations.	DNGaussian achieves comparable or better novel view synthesis quality than state-of-the-art methods on LLFF, DTU, and Blender datasets. It significantly reduces memory cost and training time (25x faster) compared to existing techniques. DNGaussian achieves real-time rendering speeds exceeding 300 FPS.	Performance degrades with increasing input views due to monocular depth errors. Challenges remain in representing solid color planes and specular regions.	novel view synthesis, 3d gaussian radiance fields, depth regularization, few-shot learning, real-time rendering
2403.06908 Report	FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization	Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, Eric Xing	3D Gaussian splatting has achieved very impressive performance in real-time novel view synthesis. However, it often suffers from over-reconstruction during Gaussian densification where high-variance image regions are covered by a few large Gaussians only, leading to blur and artifacts in the rendered images. We design a progressive frequency regularization (FreGS) technique to tackle the over-reconstruction issue within the frequency space. Specifically, FreGS performs coarse-to-fine Gaussian densification by exploiting low-to-high frequency components that can be easily extracted with low-pass and high-pass filters in the Fourier space. By minimizing the discrepancy between the frequency spectrum of the rendered image and the corresponding ground truth, it achieves high-quality Gaussian densification and alleviates the over-reconstruction of Gaussian splatting effectively. Experiments over multiple widely adopted benchmarks (e.g., Mip-NeRF360, Tanks-and-Temples and Deep Blending) show that FreGS achieves superior novel view synthesis and outperforms the state-of-the-art consistently.	Presents FreGS, an innovative 3D Gaussian splatting technique that uses progressive frequency regularization to mitigate over-reconstruction during Gaussian densification, enhancing novel view synthesis.	3D Gaussian splatting, while offering real-time rendering for novel view synthesis, often suffers from over-reconstruction artifacts. This paper addresses this limitation for higher quality rendering.	FreGS employs progressive frequency regularization using a frequency annealing technique. It extracts low and high-frequency components with filters in the Fourier space and minimizes discrepancies between the rendered and ground truth image spectra. This process progressively refines Gaussian densification.	FreGS consistently outperforms state-of-the-art methods like 3D-GS and Mip-NeRF360 in quantitative metrics like PSNR, SSIM, and LPIPS. The method generates higher quality novel view synthesis with fewer artifacts and finer details compared to existing techniques. Ablation studies confirm the individual contribution of frequency regularization and frequency annealing to the overall performance gain.	The current implementation of FreGS is focused on static scenes; handling dynamic scenes remains a challenge. Further investigation is needed to optimize the computational cost of frequency transformations for even faster rendering.	novel view synthesis, 3d gaussian splatting, frequency regularization, frequency annealing, gaussian densification
2403.06866 Report	QUASAR: QUality and Aesthetics Scoring with Advanced Representations	Sergey Kastryulin, Denis Prokopenko, Artem Babenko, Dmitry V. Dylov	This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment, surpassing existing approaches and requiring no prompt engineering or fine-tuning. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data. Through extensive evaluations of 7 state-of-the-art self-supervised models, our method demonstrates superior performance and robustness across various datasets and benchmarks. Notably, it achieves high agreement with human assessments even with limited data and shows high robustness to the nature of data and their pre-processing pipeline. Our contributions offer a streamlined solution for assessment of images while providing insights into the perception of visual information.	Introduces QUASAR, a data-driven, non-parametric method for unified image quality and aesthetics assessment using image anchors and pre-trained self-supervised models, eliminating the need for prompt engineering or fine-tuning.	Addresses the limitations of existing IQA and IAA methods, especially prompt-based approaches, by providing a more robust and generalizable solution that leverages the power of foundation models.	1. Employs image embeddings as anchors representing high and low quality/aesthetics. 2. Uses a pre-trained Image Encoder (explores various self-supervised models) to extract embeddings. 3. Applies an Aggregation Function to compute representative centroids from anchor embeddings. 4. Calculates a score based on cosine similarity between input image embedding and the centroids.	QUASAR outperforms existing non-parametric IQA methods and achieves comparable performance to learning-based IAA methods. Demonstrates robustness to the choice of anchor data and pre-processing pipeline, unlike CLIP-IQA. Achieves high agreement with human assessments even with a limited number of anchor samples.	Computational cost associated with anchor embedding generation for large datasets. Potential bias introduced by the choice of anchor data, necessitating careful selection and potential for future work in adaptive anchor selection.	image quality assessment, image aesthetics assessment, foundation models, self-supervised learning, non-parametric methods
2403.06793 Report	Boosting Image Restoration via Priors from Pre-trained Models	Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao	Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.	This paper proposes a novel Pre-Train-Guided Refinement Module (PTG-RM) that leverages off-the-shelf features (OSF) from pre-trained models like CLIP and Stable Diffusion to enhance image restoration networks.	Existing image restoration networks struggle to achieve significant performance improvements by simply modifying network structures or increasing model parameters. This work explores a new approach of leveraging rich information contained within pre-trained models to enhance restoration quality.	PTG-RM is a lightweight plugin module that refines the output of a target restoration network using OSF. It consists of two components: PTG-SVE (Spatial Varying Enhancement) which determines optimal short- and long-range operations based on OSF, and PTG-CSA (Channel-Spatial Attention) which enhances spatial-channel attention using OSF guidance.	PTG-RM significantly improves the performance of various state-of-the-art restoration networks across different tasks, including low-light enhancement, deraining, deblurring, and denoising. The method demonstrates robust generalization ability, enhancing performance even when the refinement module is trained on a different dataset than the target restoration network. User studies confirm that PTG-RM leads to subjectively better restoration results compared to baseline methods.	The extent of improvement provided by PTG-RM varies across different experiments and seems to depend on the target network's capacity and task complexity. Future work aims to explore more effective distillation frameworks for extracting refined restoration feature priors from pre-trained models to further improve performance.	image restoration, pre-trained models, clip, stable diffusion, refinement module
2403.06775 Report	FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation	Pengchong Qiao, Lei Shang, Chang Liu, Baigui Sun, Xiangyang Ji, Jie Chen	Subject-driven generation has garnered significant interest recently due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However, an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category, causing poor attribute-related generations. In this paper, motivated by object-oriented programming, we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically, we propose a plug-and-play method, Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. Codes will be open sourced soon at FaceChain (https://github.com/modelscope/facechain).	This paper presents a novel perspective for subject-driven generation by modeling a subject as a derived class of its semantic category, allowing it to inherit public attributes while learning private attributes from user-provided examples.	One-shot subject-driven generation struggles to create imaginative images, especially for attribute-related prompts, due to the limited information available in a single example image. This paper addresses this challenge by leveraging the pre-trained model's knowledge of the subject's category.	The paper proposes Subject Derivation regularization (SuDe), a plug-and-play method that constrains subject-driven generated images to semantically belong to the subject's category using the implicit classifier within the diffusion model.	SuDe significantly improves attribute-related generations, enabling the generation of images that better align with attribute-related prompts. The method maintains subject fidelity, ensuring that the generated images still resemble the user-provided subject example. SuDe is effective when combined with different baselines and backbones, demonstrating its versatility and generalizability.	The method inherits limitations of the pre-trained diffusion model, such as struggling with text characters on subjects. SuDe's performance may be limited for prompts that describe attributes indirectly related to the subject or its category.	subject-driven generation, text-to-image synthesis, diffusion models, one-shot learning, attribute editing
2403.06764 Report	An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models	Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang	In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.	This paper identifies inefficient visual attention in Large Vision-Language Models (LVLMs) and proposes FastV, a plug-and-play method to reduce inference budget without sacrificing performance.	LVLMs are computationally expensive, and understanding how they process visual information is crucial for optimizing their efficiency.	The paper analyzes attention patterns in LVLMs and finds that image tokens receive disproportionately low attention in deep layers. FastV leverages this by dynamically pruning less important image tokens based on attention scores.	FastV significantly reduces computational cost (e.g., 45% reduction in FLOPs for LLaVA-1.5-13B) without performance loss on various vision-language tasks. FastV enables LVLMs to process higher resolution images with the same token budget, improving performance. FastV demonstrates superior performance-efficiency trade-off compared to training with fewer visual tokens.	The theoretical FLOPs reduction may differ from actual inference budget due to factors like hardware and framework optimization. Further investigation is needed to understand the differences in how image and text tokens contribute to LLM processing.	large vision-language models, inference optimization, attention mechanism, token pruning, computational efficiency
2403.06738 Report	V3D: Video Diffusion Models are Effective 3D Generators	Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, Huaping Liu	Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency. Our code is available at https://github.com/heheyas/V3D	\approach is a novel 3D generation framework leveraging the world simulation capacity of pre-trained video diffusion models for high-quality object and scene generation.	Existing 3D generation methods suffer from limitations like slow optimization, limited model capacity, or reliance on 3D datasets. This work leverages pre-trained video diffusion models' ability to perceive the 3D world and generate consistent multi-view images, leading to high-quality 3D content creation.	The method involves fine-tuning video diffusion models on 3D datasets with geometrical consistency priors. For object generation, it fine-tunes on 360° orbit videos. For scene-level synthesis, it integrates a PixelNeRF encoder to accommodate multiple input images and control camera poses. Reconstruction is done using tailored pipelines with space-carving initialization for 3D Gaussians or mesh extraction refined with image-level losses.	\approach achieves state-of-the-art performance in both object-centric and scene-level 3D generation. It generates high-quality 3D objects within 3 minutes, outperforming existing methods in terms of fidelity and alignment. For novel view synthesis, \approach demonstrates superior multi-view consistency and reconstruction quality compared to previous methods.	The method may struggle with complex objects or scenes, leading to inconsistencies or unreasonable geometries. Future work includes addressing failure cases and further improving the multi-view consistency of generated content.	video diffusion models, single image to 3d, novel view synthesis, 3d generation, multi-view consistency
2403.06702 Report	Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization	Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, Rongrong Ji	Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code are released at https://github.com/Aria-Zhangjl/E3-FaceNet.	Proposes $E^3$-FaceNet, an end-to-end efficient and effective network for fast and accurate text-to-3D-aware face generation and manipulation.	Existing methods suffer from low efficiency and poor quality, often relying on complex multi-stage pipelines and test-time tuning.	Directly maps text instructions to 3D-aware visual space using a StyleNeRF-based architecture. Introduces a Style Code Enhancer for semantic alignment and a Geometric Regularization objective for multi-view consistency.	Achieves state-of-the-art generation quality on three benchmark datasets, surpassing existing T3D face methods. Significantly faster inference speed compared to other T3D methods, up to 470 times faster than Latent3D. Enables accurate and efficient text-driven 3D face manipulation.	Relies on a pre-trained StyleNeRF model, limiting its generalizability to unseen domains. The diversity of generated 3D faces can be further improved.	generative model, cross-modal mapping, text-to-3d face generation, 3d face manipulation, geometric regularization
2403.06517 Report	Active Generation for Image Classification	Tao Huang, Jiaqi Liu, Shan You, Chang Xu	Recently, the growing capabilities of deep generative models have underscored their potential in enhancing image classification accuracy. However, existing methods often demand the generation of a disproportionately large number of images compared to the original dataset, while having only marginal improvements in accuracy. This computationally expensive and time-consuming process hampers the practicality of such approaches. In this paper, we propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation. It aims to create images akin to the challenging or misclassified samples encountered by the current model and incorporates these generated images into the training set to augment model performance. ActGen introduces an attentive image guidance technique, using real images as guides during the denoising process of a diffusion model. The model's attention on class prompt is leveraged to ensure the preservation of similar foreground object while diversifying the background. Furthermore, we introduce a gradient-based generation guidance method, which employs two losses to generate more challenging samples and prevent the generated images from being too similar to previously generated ones. Experimental results on the CIFAR and ImageNet datasets demonstrate that our method achieves better performance with a significantly reduced number of generated images.	This paper presents ActGen, a training-aware approach for enhancing image classification accuracy by actively generating images mimicking challenging or misclassified samples using diffusion models.	Existing methods for augmenting image classification with synthetic data lack efficiency, often generating large amounts of redundant data for marginal improvements.	ActGen identifies misclassified images as prototypes for hard samples and utilizes attentive image guidance and gradient-based guidance within the diffusion model to generate diverse, challenging augmentations.	ActGen significantly improves classification accuracy on ImageNet and CIFAR datasets with a reduced number of generated images compared to previous methods. The attentive image guidance method, incorporating real image guidance and selective guidance with attention masks, ensures fidelity and background diversity in generated images. Gradient-based guidance, utilizing contrastive and adversarial losses, further enhances the diversity and classification difficulty of synthetic images.	The computational cost of ActGen, while significantly lower than previous methods, remains higher than traditional training. Future research can explore extending ActGen to other domains beyond image classification.	data augmentation, image classification, image generation, diffusion models, active learning
2403.06505 Report	Vosh: Voxel-Mesh Hybrid Representation for Real-Time View Synthesis	Chenhao Zhang, Yongyang Zhou, Lei Zhang	The neural radiance field (NeRF) has emerged as a prominent methodology for synthesizing realistic images of novel views. While neural radiance representations based on voxels or mesh individually offer distinct advantages, excelling in either rendering quality or speed, each has limitations in the other aspect. In response, we propose a pioneering hybrid representation named Vosh, seamlessly combining both voxel and mesh components in hybrid rendering for view synthesis. Vosh is meticulously crafted by optimizing the voxel grid of NeRF, strategically with selected voxels replaced by mesh. Therefore, it excels in fast rendering scenes with simple geometry and textures through its mesh component, while simultaneously enabling high-quality rendering in intricate regions by leveraging voxel component. The flexibility of Vosh is showcased through the ability to adjust hybrid ratios, providing users the ability to control the balance between rendering quality and speed based on flexible usage. Experimental results demonstrates that our method achieves commendable trade-off between rendering quality and speed, and notably has real-time performance on mobile devices.	Presents Vosh, a novel hybrid representation combining voxels and meshes, for real-time view synthesis with Neural Radiance Fields (NeRF).	Addresses limitations in existing NeRF methods that struggle to balance high-quality rendering with real-time performance on mobile devices.	Constructs a hybrid representation by: 1) Training an initial high-resolution voxel grid. 2) Converting suitable voxels into a mesh using differentiable surface rendering. 3) Optimizing both voxel and mesh components via hybrid rendering and voxel adjustment.	Achieves real-time rendering on mobile devices, including laptops and smartphones. Demonstrates superior rendering quality compared to mesh-based methods, particularly in representing complex scenes. Offers a controllable balance between rendering speed and quality through voxel adjustment and hybrid ratios.	Shares limitations with SNeRG and MERF, such as challenges in modeling view-dependent colors for translucent objects. Potential degradation in mesh optimization quality can impact overall rendering quality.	neural radiance field, view synthesis, real-time rendering, hybrid representation, mobile devices
2403.06403 Report	PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models	Qingdong He, Jinlong Peng, Zhengkai Jiang, Xiaobin Hu, Jiangning Zhang, Qiang Nie, Yabiao Wang, Chengjie Wang	Recent success of vision foundation models have shown promising performance for the 2D perception tasks. However, it is difficult to train a 3D foundation network directly due to the limited dataset and it remains under explored whether existing foundation models can be lifted to 3D space seamlessly. In this paper, we present PointSeg, a novel training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks. PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames. Concretely, we design a two-branch prompts learning structure to construct the 3D point-box prompts pairs, combining with the bidirectional matching strategy for accurate point and proposal prompts generation. Then, we perform the iterative post-refinement adaptively when cooperated with different vision foundation models. Moreover, we design a affinity-aware merging algorithm to improve the final ensemble masks. PointSeg demonstrates impressive segmentation performance across various datasets, all without training. Specifically, our approach significantly surpasses the state-of-the-art specialist model by 13.4$\%$, 11.3$\%$, and 12$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top of that, PointSeg can incorporate with various segmentation models and even surpasses the supervised methods.	PointSeg, a novel training-free paradigm leveraging off-the-shelf vision foundation models for 3D scene segmentation.	Addresses the limitations of training 3D foundation models due to limited datasets and explores the potential of applying existing VFMs to 3D tasks.	Utilizes a two-branch prompts learning structure to generate 3D point-box prompts pairs, refined by bidirectional matching. Employs iterative post-refinement on 2D masks and affinity-aware merging for accurate 3D segmentation.	Significantly outperforms state-of-the-art specialist models on ScanNet, ScanNet++, and KITTI-360 datasets (11.3%-13.4% mAP improvement). Demonstrates robust generalization ability across diverse indoor and outdoor 3D scenarios. Effectively incorporates and benefits from various segmentation foundation models, showing improvement transfer from 2D to 3D.	Performance can be affected by the accuracy of the underlying 2D foundation models. Future work includes exploring more 3D tasks using foundation models.	3d scene segmentation, foundation models, zero-shot learning, vision foundation models (vfms), point cloud segmentation
2403.06400 Report	DivCon: Divide and Conquer for Progressive Text-to-Image Generation	Yuhao Jia, Wenhan Tan	Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical \& spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.	This paper proposes DivCon, a novel divide-and-conquer approach for text-to-image generation that enhances numerical and spatial reasoning capabilities by dividing the task into simpler subtasks.	Current text-to-image generation models struggle to accurately generate images from text prompts with multiple objects and complex spatial relationships. DivCon addresses this challenge by decomposing the task, leading to improved accuracy and fidelity in image generation.	DivCon divides layout prediction into two steps: (1) numerical and spatial reasoning using LLMs and (2) bounding box prediction. Layout-to-image generation is also a two-step iterative process: (1) initial image synthesis and consistency evaluation and (2) refinement focusing on low-fidelity objects.	DivCon significantly outperforms previous state-of-the-art models in numerical and spatial accuracy on HRS and NSR-1K benchmarks. DivCon generates more accurate layouts with less object overlap compared to baselines. Qualitative results showcase DivCon's superior performance in handling complex prompts with multiple objects and intricate spatial arrangements.	DivCon still faces challenges in generating objects from certain pattern layouts, particularly those involving significant object overlap. Future work could focus on developing more sophisticated layout-conditioned image generation models to better handle overlapping bounding boxes.	text-to-image generation, large language models, diffusion models, divide and conquer, layout-based generation
2403.06381 Report	Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models	Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi	Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended semantics of the associated text prompts. We examine cross-attention layers in diffusion models and observe a propensity for these layers to disproportionately focus on certain tokens during the generation process, thereby undermining semantic fidelity. To address the issue of dominant attention, we introduce attention regulation, a computation-efficient on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Notably, our method requires no additional training or fine-tuning and serves as a plug-in module on a model. Hence, the generation capacity of the original model is fully preserved. We compare our approach with alternative approaches across various datasets, evaluation metrics, and diffusion models. Experiment results show that our method consistently outperforms other baselines, yielding images that more faithfully reflect the desired concepts with reduced computation overhead. Code is available at https://github.com/YaNgZhAnG-V5/attention_regulation.	The paper introduces 'attention regulation,' a method to improve the semantic fidelity of text-to-image synthesis in diffusion models by adjusting attention maps during inference.	Diffusion models, while good at generating high-quality images, often struggle to accurately represent the semantics of the input text prompt, leading to missing or misrepresented objects.	The method formulates attention map editing as a constrained optimization problem, minimizing the difference between edited and original maps while promoting attention to target tokens.	Attention regulation improves semantic alignment, as evidenced by higher CLIP scores and object detection success rates compared to baseline methods. The method is computationally efficient, adding only a 48% overhead to inference time, significantly less than other approaches. Attention regulation maintains its effectiveness across various diffusion models (Stable Diffusion 1.4, 1.5, 2, and 2.1) and datasets.	The method may generate images that deviate from human knowledge or fuse concepts in undesired ways due to limitations in the diffusion model's learned features. Future work could explore methods to align the model's understanding of features with human knowledge to further improve semantic fidelity.	diffusion models, text-to-image synthesis, semantic fidelity, attention mechanism, constrained optimization
2403.06356 Report	Video Generation with Consistency Tuning	Chaoyi Wang, Yaozhe Song, Yafeng Zhang, Jun Pei, Lijie Xia, Jianpo Liu	Currently, various studies have been exploring generation of long videos. However, the generated frames in these videos often exhibit jitter and noise. Therefore, in order to generate the videos without these noise, we propose a novel framework composed of four modules: separate tuning module, average fusion module, combined tuning module, and inter-frame consistency module. By applying our newly proposed modules subsequently, the consistency of the background and foreground in each video frames is optimized. Besides, the experimental results demonstrate that videos generated by our method exhibit a high quality in comparison of the state-of-the-art methods.	This paper introduces a novel framework for generating long videos with enhanced consistency and reduced noise, addressing the issue of jitter and noise in existing video generation methods.	Generating high-quality long videos is a challenging task with limitations in existing methods. This work aims to improve the consistency and quality of generated video frames.	The framework consists of four key modules: 1) Separate Tuning Module for extracting foreground and background, 2) Average Fusion Module for optimizing consistency, 3) Combined Tuning Module for fine-tuning with focus on foreground and background, and 4) Inter-frame Consistency Module for ensuring temporal smoothness.	Initial experiments utilizing the first two modules demonstrate promising results in generating videos with improved consistency. Visual comparisons with state-of-the-art methods highlight the effectiveness of the proposed approach. Further experiments incorporating the remaining modules are underway to showcase the full potential of the framework.	Currently, only the first two modules have been experimentally validated. Quantitative evaluation metrics for video quality are not yet provided.	video generation, diffusion models, consistency tuning, long videos, deep learning
2403.06269 Report	FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing	Youyuan Zhang, Xuan Ju, James J. Clark	Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.	This paper introduces FastVideoEdit, an efficient and zero-shot video editing approach based on consistency models, for high-quality text-driven video editing.	Existing text-driven video editing methods relying on diffusion models often suffer from high computational costs due to sequential sampling or additional condition extraction steps, making them impractical for real-time applications. FastVideoEdit tackles this challenge by leveraging the efficiency and content-preserving nature of consistency models.	FastVideoEdit utilizes the self-consistency property of consistency models to allow direct mapping between source and target videos without DDIM inversion. It introduces a special variance schedule and incorporates techniques like Batch Attention Control, background preservation via latent replacement, and TokenFlow for enhanced temporal consistency and background preservation.	FastVideoEdit achieves state-of-the-art performance on the TGVE 2023 dataset across metrics including temporal consistency, text-video alignment, and editing speed. The method significantly reduces editing time compared to previous approaches by eliminating the need for DDIM inversion and additional condition extraction. FastVideoEdit demonstrates superior background preservation capabilities compared to existing methods, particularly when editing foreground object attributes.	The performance of FastVideoEdit may require fine-tuning of hyperparameters for each specific video editing task. While generally effective, there is no guarantee of successful editing for every case, as performance can be influenced by factors like input data quality and the complexity of the edit.	video editing, diffusion models, consistency models, text-to-video editing, zero-shot learning
2403.06243 Report	BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering	Xinmin Qiu, Congying Han, Zicheng Zhang, Bonan Li, Tiande Guo, Pingyu Wang, Xuecheng Nie	Developing blind video deflickering (BVD) algorithms to enhance video temporal consistency, is gaining importance amid the flourish of image processing and video generation. However, the intricate nature of video data complicates the training of deep learning methods, leading to high resource consumption and instability, notably under severe lighting flicker. This underscores the critical need for a compact representation beyond pixel values to advance BVD research and applications. Inspired by the classic scale-time equalization (STE), our work introduces the histogram-assisted solution, called BlazeBVD, for high-fidelity and rapid BVD. Compared with STE, which directly corrects pixel values by temporally smoothing color histograms, BlazeBVD leverages smoothed illumination histograms within STE filtering to ease the challenge of learning temporal data using neural networks. In technique, BlazeBVD begins by condensing pixel values into illumination histograms that precisely capture flickering and local exposure variations. These histograms are then smoothed to produce singular frames set, filtered illumination maps, and exposure maps. Resorting to these deflickering priors, BlazeBVD utilizes a 2D network to restore faithful and consistent texture impacted by lighting changes or localized exposure issues. BlazeBVD also incorporates a lightweight 3D network to amend slight temporal inconsistencies, avoiding the resource consumption issue. Comprehensive experiments on synthetic, real-world and generated videos, showcase the superior qualitative and quantitative results of BlazeBVD, achieving inference speeds up to 10x faster than state-of-the-arts.	BlazeBVD, a histogram-assisted blind video deflickering method that uses deflickering priors from Scale-Time Equalization (STE) to simplify the complexity and resource demands of deflickering.	Existing deep learning methods for blind video deflickering (BVD) are computationally expensive and struggle with severe lighting flicker, demanding a more compact representation than pixel values.	BlazeBVD prepares deflickering priors (filtered illumination map, singular frames set, exposure maps) from STE. It then uses a Global Flicker Removal Module (GFRM) guided by the filtered illumination map and a Local Flicker Removal Module (LFRM) based on optical flow warping and exposure maps. Finally, a lightweight spatio-temporal network enhances temporal consistency.	BlazeBVD achieves superior qualitative and quantitative results on synthetic, real-world, and generated videos, outperforming state-of-the-art methods. It effectively tackles both illumination fluctuations and over-/under-exposure challenges, preserving texture details. BlazeBVD achieves inference speeds up to 10x faster than previous methods due to its efficient histogram-based representation and modular design.	Inaccurate optical flow estimation in LFRM can lead to minor artifacts. Balancing faithfulness and coherence in generated videos requires further investigation.	video deflickering, histogram, temporal consistency, scale-time equalization, exposure correction
2403.06213 Report	$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections	Roy Miles, Ismail Elezi, Jiankang Deng	Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd	The paper presents $V_kD$, a novel knowledge distillation method using orthogonal projections to maximize knowledge transfer by preserving intra-batch feature similarity.	Existing knowledge distillation methods often rely on heuristics, lack adaptability to diverse tasks, and introduce significant computational overhead.	The method utilizes an orthogonal projection layer, derived from the principle of preserving feature similarity, and efficiently implemented via projection onto the Stiefel manifold. It also introduces task-specific normalization to improve performance in both discriminative and generative tasks.	$V_kD$ achieves state-of-the-art performance on ImageNet, outperforming previous methods by up to 4.4%. It demonstrates consistent improvements in object detection tasks using ViDT architecture. For data-limited image generation, $V_kD$ with feature whitening outperforms KD-DLGAN without needing auxiliary diversity losses.	The paper mainly evaluates the method on visual tasks; further exploration in other domains is needed. Investigating the impact of different kernel choices for the similarity preservation constraint could be beneficial.	knowledge distillation, orthogonal projection, feature similarity, task-specific normalization, vision transformers
2403.06168 Report	DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation	Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jinlong Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin, Chengjie Wang, Rongrong Ji	Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale greenscreen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks.	This paper introduces DiffuMatting, a novel diffusion-based model that generates arbitrary objects with accompanying high-quality matting-level annotations.	Creating matting-level annotations is labor-intensive and existing datasets are limited. DiffuMatting addresses this by acting as a data factory for high-quality synthetic matting data, benefiting downstream tasks like image composition and matting algorithm training.	The model is trained on a newly created Green100k dataset, containing images with green-screen backgrounds and accurate matting annotations. It leverages a green-background control loss for background consistency and a detailed-enhancement loss for fine edge details. A dedicated matting head in the VAE decoder extracts matting masks, further refined by a GreenPost process.	DiffuMatting outperforms existing methods in generating clean green-screen objects. Synthetic data generated by DiffuMatting improves the performance of general object and portrait matting tasks, reducing MSE errors by 15.4% and 11.4% respectively. DiffuMatting is versatile and compatible with LoRA models and ControlNet for customized styles and controllable image editing.	Currently limited to generating matting annotations for green-screen images, requiring further exploration for general backgrounds. Potential for misuse in illicit industries, necessitating explicit markings on generated content.	matting generation, diffusion models, synthetic data, controllable generation, image composition
2403.06135 Report	MACE: Mass Concept Erasure in Diffusion Models	Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, Adams Wai-Kin Kong	The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE.	MACE is a finetuning framework for Mass Concept Erasure in text-to-image diffusion models, capable of removing a large number of concepts (up to 100) while maintaining a balance between generality and specificity.	Concept erasure is crucial for mitigating risks associated with large-scale T2I models, such as generating harmful, copyrighted, or offensive content, which current methods struggle to handle effectively.	MACE leverages closed-form cross-attention refinement to remove residual information of target concepts and employs LoRA finetuning with concept-focal importance sampling to erase intrinsic concept information. It also integrates multiple LoRAs to prevent interference and catastrophic forgetting.	MACE outperforms SOTA methods in erasing objects, celebrities, explicit content, and artistic styles while preserving unrelated concepts. It effectively removes concepts even when prompted with synonyms, demonstrating strong generality. MACE scales well to erasing a large number of concepts (100) with minimal impact on the generation of unrelated concepts.	Performance slightly declines when scaling from 10 to 100 erased concepts. Further research is needed to enhance scalability for erasing thousands of concepts in future models.	concept erasure, text-to-image synthesis, diffusion models, ethical ai, lora
2403.06098 Report	VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models	Wenhao Wang, Yi Yang	The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, along with other text-to-video diffusion models, is highly reliant on prompts, and there is no publicly available dataset that features a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users. Additionally, this dataset includes 6.69 million videos generated by four state-of-the-art diffusion models, alongside some related data. We initially discuss the curation of this large-scale dataset, a process that is both time-consuming and costly. Subsequently, we underscore the need for a new prompt dataset specifically designed for text-to-video generation by illustrating how VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Our extensive and diverse dataset also opens up many exciting new research areas. For instance, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models. The project (including the collected dataset VidProM and related code) is publicly available at https://vidprom.github.io under the CC-BY-NC 4.0 License.	Introduces \dsnameM, the first large-scale dataset of 1.67 million unique text-to-video prompts and 6.69 million corresponding videos generated using four state-of-the-art diffusion models.	Addresses the lack of publicly available datasets for studying text-to-video prompts, crucial for advancing text-to-video generation models like Sora.	Collects prompts from Pika Discord channels, generates videos using Pika, Text2Video-Zero, VideoCraft2, and ModelScope. Embeds prompts using OpenAI's text-embedding-3-large and assigns NSFW probabilities using Detoxify.	VidProM contains 1.67M unique prompts and 6.69M videos, significantly more diverse than existing text-to-image prompt datasets. Analysis reveals text-to-video prompts are more dynamic, complex, and longer than text-to-image prompts, highlighting the need for a dedicated dataset. Benchmarks show existing fake image detection methods generalize poorly to fake videos, demonstrating the dataset's value for developing specialized detectors.	Current generated videos are short and may not reflect the highest quality possible. Dataset currently lacks videos generated by advanced models like Sora, planned for future updates.	text-to-video generation, diffusion models, prompt engineering, dataset, fake video detection
2403.06092 Report	Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?	Hanxin Zhu, Tianyu He, Xin Li, Bingchen Li, Zhibo Chen	Neural Radiance Field (NeRF) has achieved superior performance for novel view synthesis by modeling the scene with a Multi-Layer Perception (MLP) and a volume rendering procedure, however, when fewer known views are given (i.e., few-shot view synthesis), the model is prone to overfit the given views. To handle this issue, previous efforts have been made towards leveraging learned priors or introducing additional regularizations. In contrast, in this paper, we for the first time provide an orthogonal method from the perspective of network structure. Given the observation that trivially reducing the number of model parameters alleviates the overfitting issue, but at the cost of missing details, we propose the multi-input MLP (mi-MLP) that incorporates the inputs (i.e., location and viewing direction) of the vanilla MLP into each layer to prevent the overfitting issue without harming detailed synthesis. To further reduce the artifacts, we propose to model colors and volume density separately and present two regularization terms. Extensive experiments on multiple datasets demonstrate that: 1) although the proposed mi-MLP is easy to implement, it is surprisingly effective as it boosts the PSNR of the baseline from $14.73$ to $24.23$. 2) the overall framework achieves state-of-the-art results on a wide range of benchmarks. We will release the code upon publication.	This paper introduces mi-MLP, a multi-input MLP designed to address overfitting in few-shot novel view synthesis by incorporating location and viewing direction inputs into each layer, enhancing flexibility without sacrificing model capacity.	Few-shot novel view synthesis with NeRF suffers from overfitting due to limited training views, resulting in poor generalization and artifacts. This work explores network structure modification as an alternative solution.	The paper proposes mi-MLP, incorporating inputs into every MLP layer. Additionally, it proposes separate modeling of color and volume density with different positional encoding frequencies. Two regularization techniques are introduced: background regularization for object-centric scenes and sampling annealing for near-field artifacts.	mi-MLP significantly improves PSNR compared to the baseline (e.g., 14.73 to 24.23 on Blender). The proposed method achieves state-of-the-art results on Blender, LLFF, and Shiny datasets. Ablation studies confirm the effectiveness of mi-MLP, separate modeling, and regularization techniques.	Consistency for complex textures or thin structures is limited due to no constraints on unknown views. Future work includes exploring additional regularizations and priors for improved novel view synthesis.	novel view synthesis, neural radiance fields (nerf), few-shot learning, multi-layer perceptron (mlp), overfitting
2403.05907 Report	Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving	Junyi Cao, Zhichao Li, Naiyan Wang, Chao Ma	Recent studies have highlighted the promising application of NeRF in autonomous driving contexts. However, the complexity of outdoor environments, combined with the restricted viewpoints in driving scenarios, complicates the task of precisely reconstructing scene geometry. Such challenges often lead to diminished quality in reconstructions and extended durations for both training and rendering. To tackle these challenges, we present Lightning NeRF. It uses an efficient hybrid scene representation that effectively utilizes the geometry prior from LiDAR in autonomous driving scenarios. Lightning NeRF significantly improves the novel view synthesis performance of NeRF and reduces computational overheads. Through evaluations on real-world datasets, such as KITTI-360, Argoverse2, and our private dataset, we demonstrate that our approach not only exceeds the current state-of-the-art in novel view synthesis quality but also achieves a five-fold increase in training speed and a ten-fold improvement in rendering speed. Codes are available at https://github.com/VISION-SJTU/Lightning-NeRF .	This paper introduces Lightning-NeRF, an efficient novel view synthesis framework for large-scale outdoor scenes that leverages point clouds and images in autonomous driving scenarios.	Existing NeRF methods struggle to balance high-fidelity reconstruction with computational efficiency, especially in outdoor driving scenarios where scenes are vast and computationally expensive to process.	The proposed method employs a hybrid scene representation that explicitly models density with a voxel grid initialized by LiDAR point clouds and implicitly models color with shallow MLPs. It also incorporates efficient background modeling and color decomposition to enhance rendering quality and extrapolation ability.	Lightning-NeRF outperforms state-of-the-art methods in novel view synthesis quality on KITTI-360 and Argoverse2 datasets. It achieves a five-fold improvement in training speed and a ten-fold improvement in rendering speed compared to previous methods. The method demonstrates superior extrapolation capabilities, vital for simulating novel views in autonomous driving scenarios.	The method assumes the availability of LiDAR data, which might not be universally applicable. Future work could explore dynamically adjusting the resolution of the hybrid representation for better efficiency.	neural radiance fields, novel view synthesis, autonomous driving, lidar, hybrid scene representation
2403.05846 Report	Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines	Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov	Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.	The paper introduces Diffusion Lens, a novel method for analyzing the internal workings of text encoders in text-to-image diffusion models by generating images from intermediate layers of the encoder.	The text encoder is a key component of text-to-image generation, yet its internal mechanisms are poorly understood. This work provides a new tool to analyze how these encoders represent and process language.	The method extracts the hidden state representations from different layers of the text encoder, passes them through the final layer norm, and feeds them to the diffusion model to generate images. These images provide a visual representation of how the text is encoded at each layer.	Complex concepts are composed gradually, with simpler concepts emerging in earlier layers and relationships between concepts solidifying in later layers. Common concepts are retrieved earlier in the network compared to uncommon concepts, suggesting gradual knowledge retrieval. Different text encoders (T5 vs. CLIP) exhibit different representation building patterns, potentially influenced by training data and objectives.	The study primarily relies on automatically generated prompts, which might not fully represent the complexity of human language. The method requires manual analysis of generated images to derive insights, limiting the scale of analysis.	text-to-image generation, diffusion models, text encoder, interpretability, conceptual combination
2403.05726 Report	Augmentations vs Algorithms: What Works in Self-Supervised Learning	Warren Morningstar, Alex Bijamov, Chris Duvarney, Luke Friedman, Neha Kalibhat, Luyang Liu, Philip Mansfield, Renan Rojas-Gomez, Karan Singhal, Bradley Green, Sushant Prakash	We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than $1\%$), while enhanced augmentation techniques offer more significant performance improvements ($2-4\%$). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning.	This paper investigates the relative contributions of data augmentations, pretraining algorithms, and model architectures to the performance of self-supervised learning (SSL), demonstrating that data augmentation diversity and model scale are more impactful than algorithmic innovations.	The importance of this study lies in clarifying the key drivers of SSL performance, which has been often attributed to algorithmic improvements, and providing insights for future research directions.	The authors propose a unified framework encompassing popular SSL methods and conduct experiments comparing SimCLR, BYOL, SwAV, MoCo v2, DINO, and MoCo v3 with varying augmentations, algorithms, and architectures.	Increasing augmentation diversity significantly improves downstream task performance across all methods, contributing to a substantial portion of performance gains in recent SSL advances. Algorithmic enhancements, such as momentum encoders and prediction networks, show a smaller performance impact than augmentations, with their effects varying across different methods. Increasing model size, specifically switching from ResNet-50 to ViT-B, leads to a notable performance improvement, supporting the significance of model scale in SSL.	The study primarily focuses on instance-based joint embedding methods, excluding other SSL paradigms such as generative models. While the paper demonstrates the importance of augmentations, further investigation is needed to understand the interplay between specific augmentations and SSL algorithms, especially in the context of increasingly diverse augmentations.	self-supervised learning, data augmentation, pretraining algorithms, model architectures, representation learning
2403.05438 Report	VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models	Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo	Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.	VideoElevator is a training-free and plug-and-play method that enhances the quality of text-to-video diffusion models (T2V) by integrating them with various text-to-image diffusion models (T2I).	Existing T2V models often produce videos with lower quality and fidelity than T2I models due to the limitations of training video datasets. VideoElevator leverages the superior capabilities of T2I models to improve the quality of T2V generated videos.	VideoElevator decomposes each sampling step into temporal motion refining and spatial quality elevating. Temporal motion refining enhances motion consistency using a low-pass filter and T2V-based SDEdit. Spatial quality elevating employs an inflated T2I to add high-quality details. To ensure interaction between models, VideoElevator projects noise latents to clean latents using DDIM inversion.	VideoElevator significantly improves the performance of T2V baselines in terms of frame quality, text alignment, and aesthetic style when integrated with either foundational or personalized T2I. Human evaluation shows a strong preference for videos generated by VideoElevator-enhanced T2V models. VideoElevator is compatible with personalized Stable Diffusion XL (SDXL) models, including those fine-tuned with LoRA and Diffusion-DPO.	The paper focuses on improving quality and doesn't explicitly address aspects like video length or computational efficiency. Further exploration is needed to optimize the trade-off between quality improvement and computational cost.	video generation, text-to-video synthesis, diffusion models, text-to-image diffusion, video quality enhancement
2403.05239 Report	Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation	Junyan Wang, Zhenhong Sun, Zhiyu Tan, Xuanbai Chen, Weihua Chen, Hao Li, Cheng Zhang, Yang Song	Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.	This paper proposes a novel Human-centric Prior (HcP) layer to enhance the accuracy of human image generation in text-to-image diffusion models without requiring additional conditions during inference.	Generating accurate human images from text descriptions is crucial for various applications, but vanilla diffusion models often struggle with this task, resulting in anatomical imperfections.	The HcP layer is trained with a human-centric alignment loss to better align cross-attention maps with human-centric textual information. This approach incorporates human-centric prior knowledge, such as pose images, directly into the model fine-tuning stage.	The HcP layer significantly improves the structural accuracy of generated human images, particularly in depicting complex poses and proportions. The proposed method preserves the original generative capabilities and style of the pre-trained diffusion model, unlike methods like LoRA that might alter the model's expressiveness. The HcP layer is a plug-and-play module compatible with other controllable text-to-image diffusion models like ControlNet, further enhancing their capabilities.	The model currently relies on a single type of human-centric prior information (e.g., pose). There is room for improvement in handling highly complex scenes with multiple interacting individuals.	text-to-image generation, diffusion models, human image synthesis, cross-attention, human-centric priors
2403.05231 Report	Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance	Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling	Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on a multilayer perceptron (MLP) to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.743 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models will be released.	Proposes LoRAT, a novel visual tracking method leveraging Low-Rank Adaptation (LoRA) within a one-stream tracking framework for efficient fine-tuning of large Vision Transformers, making them more accessible for resource-constrained researchers.	Large Vision Transformers, while powerful for visual tracking, demand significant computational resources, making their training impractical for most researchers.	Adapts LoRA to a one-stream tracking architecture with two key designs: 1) a decoupled input embedding with shared spatial and independent type embeddings for preserving the pre-trained ViT structure; 2) an MLP-only head network to mitigate inductive biases from convolutional heads.	Achieves state-of-the-art performance on multiple benchmarks, setting a new record on LaSOT with 0.762 SUC score using ViT-g backbone. Significantly reduces training time and memory requirements compared to full fine-tuning, enabling training of large models with limited resources. Demonstrates the feasibility of training advanced tracking models with manageable resources, making cutting-edge research accessible to a wider community.	Limited exploration of LoRA rank variation's impact on different ViT backbones. Future work could explore combining LoRAT with other PEFT techniques for further efficiency.	visual object tracking, lora, parameter-efficient fine-tuning, vision transformer, one-stream tracking
2403.05154 Report	GSEdit: Efficient Text-Guided Editing of 3D Objects via Gaussian Splatting	Francesco Palandra, Andrea Sanchietti, Daniele Baieri, Emanuele Rodolà	We present GSEdit, a pipeline for text-guided 3D object editing based on Gaussian Splatting models. Our method enables the editing of the style and appearance of 3D objects without altering their main details, all in a matter of minutes on consumer hardware. We tackle the problem by leveraging Gaussian splatting to represent 3D scenes, and we optimize the model while progressively varying the image supervision by means of a pretrained image-based diffusion model. The input object may be given as a 3D triangular mesh, or directly provided as Gaussians from a generative model such as DreamGaussian. GSEdit ensures consistency across different viewpoints, maintaining the integrity of the original object's information. Compared to previously proposed methods relying on NeRF-like MLP models, GSEdit stands out for its efficiency, making 3D editing tasks much faster. Our editing process is refined via the application of the SDS loss, ensuring that our edits are both precise and accurate. Our comprehensive evaluation demonstrates that GSEdit effectively alters object shape and appearance following the given textual instructions while preserving their coherence and detail.	Introduces GS-Edit, a pipeline for efficient text-guided 3D object editing using Gaussian Splatting models and image diffusion models.	Empowers 3D artists with fast and automated editing capabilities, enhancing workflow in creative and industrial pipelines.	Leverages Gaussian Splatting for scene representation and optimizes it by progressively modifying image supervision via a pretrained image-based diffusion model (Instruct-Pix2Pix). Employs SDS loss for accurate editing and supports both mesh and point cloud inputs.	Achieves significant object shape and appearance modifications based on textual prompts. Preserves object coherence and detail during editing. Demonstrates superior efficiency compared to NeRF-based methods, enabling editing within minutes on consumer hardware.	Editing scope limited by Instruct-Pix2Pix capabilities, hindering significant spatial transformations (e.g., pose alteration). Perspective bias in Instruct-Pix2Pix can introduce artifacts, impacting the consistency and quality of edits.	gaussian splatting, radiance fields, 3d object editing, text-guided editing, diffusion models
2403.05139 Report	Improving Diffusion Models for Virtual Try-on	Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin	This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io	This paper introduces IDM-VTON, a novel diffusion model for authentic virtual try-on that improves garment fidelity by using two modules to encode garment semantics: an image prompt adapter for high-level semantics and a UNet encoder (GarmentNet) for low-level features.	Existing diffusion-based virtual try-on methods struggle to preserve fine-grained details of garments, hindering their real-world applicability. This method aims to address this limitation and generate more realistic and detailed try-on images.	The model consists of a base UNet (TryonNet) for the person image, an image prompt adapter for garment semantics, and GarmentNet for detailed garment features. They leverage Stable Diffusion XL and incorporate detailed garment captions to enhance the model's understanding. Additionally, they propose a customization method using a single pair of garment and person images for better adaptation to real-world scenarios.	IDM-VTON outperforms previous diffusion-based and GAN-based methods in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. The use of GarmentNet significantly improves the preservation of fine-grained garment details compared to using only the image prompt adapter. The proposed customization method significantly enhances the visual quality and garment fidelity, especially in challenging, real-world scenarios.	The model may not perfectly preserve human attributes on masked regions like tattoos or skin moles. Future work includes exploring broader applications like controlling garment generation through textual prompts.	virtual try-on, diffusion models, image generation, garment fidelity, customization
2403.05135 Report	ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu	Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.	This paper introduces ELLA, a lightweight approach that equips existing CLIP-based text-to-image diffusion models with Large Language Models (LLMs) to enhance their ability to understand and generate images from dense prompts, without requiring training of the LLM or the diffusion model's U-Net.	Existing text-to-image models often struggle with dense prompts that describe multiple objects, detailed attributes, and complex relationships. ELLA addresses this limitation by incorporating the superior language understanding of LLMs.	ELLA uses a pre-trained LLM as the text encoder and introduces a novel Timestep-Aware Semantic Connector (TSC). TSC dynamically extracts timestep-dependent semantic features from the LLM, effectively guiding the frozen U-Net at different stages of the image generation process.	ELLA significantly outperforms CLIP-based diffusion models on dense prompt following benchmarks. ELLA demonstrates strong compatibility with community models and downstream tools like LoRA and ControlNet, enhancing their prompt-following capabilities. User studies confirm that ELLA leads to improved text-image alignment while maintaining competitive aesthetic quality.	The training captions, synthesized by MLLM, might not be entirely reliable in terms of shape and spatial relationship understanding. The aesthetic quality of generated images might be limited by the use of a frozen U-Net.	text-to-image generation, diffusion models, large language models, semantic alignment, dense prompts
2403.05131 Report	Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation	Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, Chaoning Zhang	Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing. This survey critically examines the progression of text-to-video technologies, focusing on the shift from traditional generative models to the cutting-edge Sora model, highlighting developments in scalability and generalizability. Distinguishing our analysis from prior works, we offer an in-depth exploration of the technological frameworks and evolutionary pathways of these models. Additionally, we delve into practical applications and address ethical and technological challenges such as the inability to perform multiple entity handling, comprehend causal-effect learning, understand physical interaction, perceive object scaling and proportioning, and combat object hallucination which is also a long-standing problem in generative models. Our comprehensive discussion covers the topic of enablement of text-to-video generation models as human-assistive tools and world models, as well as eliciting model's shortcomings and summarizing future improvement direction that mainly centers around training datasets and evaluation metrics (both automatic and human-centered). Aimed at both newcomers and seasoned researchers, this survey seeks to catalyze further innovation and discussion in the growing field of text-to-video generation, paving the way for more reliable and practical generative artificial intelligence technologies.	This paper presents a comprehensive survey of text-to-video generation models, focusing on their evolution from traditional methods to the advanced Sora model by OpenAI.	Text-to-video generation is a significant frontier in generative AI, with potential to revolutionize content creation across various fields like entertainment, education, and marketing.	The authors chronologically review key technologies, model architectures (GAN, autoregressive, diffusion), evaluation metrics, applications, and limitations of these models. They delve into Sora's capabilities as a potential 'world model' and discuss its human-centered design.	Diffusion-based models, including Sora, have become the dominant approach in text-to-video generation due to their ability to generate high-quality and coherent videos. Despite advancements, challenges remain in areas like handling multiple entities, causal-effect learning, physical interaction simulation, and object scaling. There's a need for larger, more diverse text-video datasets and more sophisticated evaluation metrics that go beyond just visual quality.	The paper focuses heavily on Sora, which, despite its significance, limits the depth of analysis on other models. The ethical considerations, while mentioned, could be explored in more detail, especially regarding potential misuse and bias.	text-to-video generation, generative ai, sora model, world models, ai ethics
2403.05125 Report	Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis	Muxi Chen, Yi Liu, Jian Yi, Changran Xu, Qiuxia Lai, Hongliang Wang, Tsung-Yi Ho, Qiang Xu	In this paper, we present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models, applied to human image synthesis. Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness. We introduce an innovative aesthetic score prediction model that assesses the visual appeal of generated images and unveils the first dataset marked with low-quality regions in generated human images to facilitate automatic defect detection. Our exploration into concept coverage probes the model's effectiveness in interpreting and rendering text-based concepts accurately, while our analysis of fairness reveals biases in model outputs, with an emphasis on gender, race, and age. While our study is grounded in human imagery, this dual-faceted approach is designed with the flexibility to be applicable to other forms of image generation, enhancing our understanding of generative models and paving the way to the next generation of more sophisticated, contextually aware, and ethically attuned generative models. We will release our code, the data used for evaluating generative models and the dataset annotated with defective areas soon.	This paper presents an empirical study with a new evaluation framework for text-to-image (T2I) generative models, specifically for human image synthesis.	Existing evaluation metrics are insufficient to fully capture model performance, especially in terms of realism, adherence to text prompts, and potential biases.	The framework uses two approaches: 1) Image Quality: A new aesthetic score prediction model (CAN) and a dataset with annotated defects in generated human images are introduced. 2) Text Condition: Concept coverage is assessed using VQA-based metrics, and fairness is analyzed by identifying potential biases in gender, race, and age.	Midjourney generates images with higher aesthetic scores and lower defect rates compared to Stable Diffusion models. Stable Diffusion models have shown improvements in aesthetics and realism with each update (SD1.5 to SDXL). All evaluated models exhibit significant fairness issues, often generating biased images based on gender, race, and age despite no explicit prompt specification.	The current defect identification model requires further improvement. The concept coverage evaluation currently focuses on single concepts and needs to be expanded to address multiple concepts in a single prompt.	text-to-image synthesis, generative models, human image generation, evaluation framework, bias detection
2403.05121 Report	CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion	Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang	Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.	This paper introduces CogView3, a novel text-to-image generation system leveraging relay diffusion to enhance efficiency and detail refinement.	Existing single-stage text-to-image diffusion models are computationally expensive and struggle with detailed refinement, prompting the need for more efficient and effective approaches.	CogView3 employs a cascaded framework, generating low-resolution images before applying relay-based super-resolution for refinement. This approach, implemented in the latent image space with a linear blurring schedule, reduces training and inference costs while maintaining output quality. Notably, it uses a pretrained T5-XXL text encoder and a variational KL-regularized autoencoder for latent representation.	CogView3 outperforms SDXL in human evaluations by 77.0% while halving inference time. The distilled CogView3 achieves comparable performance to SDXL using only 1/10 of the inference time. Prompt expansion techniques significantly improve CogView3's instruction following capabilities.	Exploring the generation of even higher resolution images (e.g., 4096x4096) using tiled diffusion methods is a potential future direction. Further investigation into optimizing the trade-off between generation quality and inference speed during distillation is warranted.	text-to-image generation, diffusion models, relay diffusion, cascaded framework, super-resolution
2403.05094 Report	Face2Diffusion for Fast and Editable Face Personalization	Kaede Shiohara, Toshihiko Yamasaki	Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.	This paper introduces Face2Diffusion (F2D), a novel method for face personalization in text-to-image diffusion models that enhances editability while preserving identity similarity.	Existing face personalization techniques often lead to overfitting on training samples, compromising the model's ability to generate diverse images that adhere to different text prompts while maintaining the subject's identity.	F2D tackles the overfitting problem through three key innovations: 1) a multi-scale identity encoder for disentangling camera poses, 2) expression guidance for separating expressions from identity features, and 3) class-guided denoising regularization to enhance text-alignment in backgrounds.	F2D outperforms nine state-of-the-art methods in balancing identity preservation and text alignment, evidenced by achieving the best scores in combined identity-text metrics. The multi-scale identity encoder successfully disentangles camera poses, leading to improved editability compared to using only the deepest layer features. Class-guided denoising regularization effectively reduces overfitting to background information without compromising identity similarity, unlike techniques like DSC.	The reliance on the class word "a person" in CGDR makes the model susceptible to biases inherent in the base T2I model's representation of that concept. Future work can focus on mitigating the potential misuse of face personalization for creating misleading content, such as by contributing generated images to image forensic research.	face personalization, text-to-image synthesis, diffusion models, overfitting, disentanglement
2403.05087 Report	SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting	Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, Zeyu Wang	We present SplattingAvatar, a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh, which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation, while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion, we control the rotation and translation of the Gaussians directly by mesh, which empowers its compatibility with various animation techniques, e.g., skeletal animation, blend shapes, and mesh editing. Trainable from monocular videos for both full-body and head avatars, SplattingAvatar shows state-of-the-art rendering quality across multiple datasets.	SplattingAvatar: a hybrid 3D representation for photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh for real-time rendering.	Addresses limitations of NeRF and MLP-based motion control in capturing high-frequency details and surface deformations for real-time realistic avatar rendering.	Combines Gaussian Splatting for high-frequency details with mesh representation for low-frequency motion and deformation. Uses lifted optimization for joint optimization of Gaussian parameters and mesh embeddings, enabling explicit motion control of Gaussians by the mesh.	Achieves state-of-the-art rendering quality for both head and full-body avatars from monocular videos. Demonstrates efficient real-time rendering capabilities in Unity, achieving over 300 FPS on an NVIDIA RTX 3090 GPU and 30 FPS on an iPhone 13. Outperforms existing methods in terms of photometric quality with improved details and handling of thin structures, as evidenced by quantitative metrics like PSNR, SSIM, and LPIPS.	Performance depends on the motion representation capability of the driving mesh, limited by current FLAME and SMPL-X models. Lacks separate motion representation for clothes and hair.	human avatar, gaussian splatting, real-time rendering, mesh embedding, lifted optimization
2403.05056 Report	Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation	Yifan Mao, Jian Liu, Xianming Liu	Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model's depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code and weights are available at: https://github.com/hitcslj/SSD.	This paper introduces Stealing Stable Diffusion (SSD), a novel approach that leverages stable diffusion priors for robust monocular depth estimation in challenging conditions like low-light and rain.	Existing monocular depth estimation methods struggle in challenging conditions due to the lack of diverse training data and the limitations of existing data augmentation techniques.	SSD utilizes a generative diffusion model-based translation (GDT) model to generate synthetic images mimicking challenging conditions, employs a self-training mechanism with a teacher-student network architecture, and incorporates a novel teacher loss and semantic loss for improved knowledge distillation.	SSD outperforms existing methods on nuScenes and RobotCar datasets, achieving state-of-the-art performance in challenging conditions. The GDT model effectively generates diverse and realistic images of challenging conditions, surpassing GAN-based methods. The proposed teacher loss and semantic loss contribute to improved depth estimation accuracy by facilitating effective knowledge transfer and semantic feature alignment.	The performance of SSD relies on the quality and diversity of the generated synthetic images, which can be further improved with advancements in generative diffusion models. The computational cost of SSD is higher than some existing methods due to the use of multiple large pre-trained models.	monocular depth estimation, robustness, stable diffusion, self-training, generative diffusion models
2403.05053 Report	PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering	Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin	Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.	Proposes PrimeComposer, a faster training-free diffusion model for image composition that leverages attention steering to seamlessly integrate objects while preserving their appearance and ensuring natural coherence.	Current training-free image composition methods struggle to maintain object appearance and coherent integration, especially across different visual domains. They also suffer from slow inference due to unnecessary background generation.	Formulates composition as a local editing task focused on the foreground. Employs a Correlation Diffuser to generate attention weights capturing object appearance and coherence information, which are then used to guide the main diffusion model (LDM). Introduces Region-constrained Cross-Attention (RCA) to restrict the impact of object-specific words to desired regions, further enhancing coherence. Extends classifier-free guidance to reinforce the steering effect.	Outperforms state-of-the-art methods qualitatively and quantitatively in cross-domain image composition. Exhibits significantly faster inference speed compared to the previous best training-free method. Receives favorable feedback in user studies across various domains, demonstrating its effectiveness in preserving object appearance, background consistency, and seamless composition.	Limited control over object viewpoint. Current methodology cannot seamlessly integrate multiple objects simultaneously.	image composition, diffusion models, attention steering, local image editing, training-free
2403.05018 Report	InstructGIE: Towards Generalizable Image Editing	Zichong Meng, Changdi Yang, Jun Liu, Hao Tang, Pu Zhao, Yanzhi Wang	Recent advances in image editing have been driven by the development of denoising diffusion models, marking a significant leap forward in this field. Despite these advances, the generalization capabilities of recent image editing approaches remain constrained. In response to this challenge, our study introduces a novel image editing framework with enhanced generalization robustness by boosting in-context learning capability and unifying language instruction. This framework incorporates a module specifically optimized for image editing tasks, leveraging the VMamba Block and an editing-shift matching strategy to augment in-context learning. Furthermore, we unveil a selective area-matching technique specifically engineered to address and rectify corrupted details in generated images, such as human facial features, to further improve the quality. Another key innovation of our approach is the integration of a language unification technique, which aligns language embeddings with editing semantics to elevate the quality of image editing. Moreover, we compile the first dataset for image editing with visual prompts and editing instructions that could be used to enhance in-context capability. Trained on this dataset, our methodology not only achieves superior synthesis quality for trained tasks, but also demonstrates robust generalization capability across unseen vision tasks through tailored prompts.	This paper introduces InstructGIE, a novel image editing framework that improves generalization robustness in image editing by enhancing in-context learning and unifying language instructions.	Existing image editing methods struggle to generalize to unseen editing tasks due to limitations in understanding complex visual and textual instructions.	The proposed InstructGIE framework utilizes a VMamba-based module and an editing-shift matching strategy to enhance in-context learning. It also employs a language unification technique to align language embeddings with editing semantics. Additionally, a selective area-matching method refines details in generated images.	InstructGIE demonstrates superior synthesis quality for trained image editing tasks. The framework exhibits robust generalization capabilities across unseen vision tasks through tailored prompts. Quantitative and qualitative evaluations demonstrate significant improvements in FID and CLIP directional Similarity scores compared to existing methods.	The dependence on pre-trained models like CLIP and Mask2Former introduces potential biases. Further exploration of more complex and nuanced editing instructions is an area for future research.	image editing, in-context learning, diffusion model, generalization, visual prompting
2403.04993 Report	PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts	Zewen Chen, Haina Qin, Juan Wang, Chunfeng Yuan, Bing Li, Weiming Hu, Liang Wang	Due to the diversity of assessment requirements in various application scenarios for the IQA task, existing IQA methods struggle to directly adapt to these varied requirements after training. Thus, when facing new requirements, a typical approach is fine-tuning these models on datasets specifically created for those requirements. However, it is time-consuming to establish IQA datasets. In this work, we propose a Prompt-based IQA (PromptIQA) that can directly adapt to new requirements without fine-tuning after training. On one hand, it utilizes a short sequence of Image-Score Pairs (ISP) as prompts for targeted predictions, which significantly reduces the dependency on the data requirements. On the other hand, PromptIQA is trained on a mixed dataset with two proposed data augmentation strategies to learn diverse requirements, thus enabling it to effectively adapt to new requirements. Experiments indicate that the PromptIQA outperforms SOTA methods with higher performance and better generalization. The code will be available.	This paper introduces PromptIQA, a novel No-Reference Image Quality Assessment (NR-IQA) framework that adapts to new assessment requirements using a small set of image-score pairs as prompts, eliminating the need for fine-tuning.	Existing IQA methods struggle to adapt to diverse assessment requirements across different applications. Fine-tuning on new datasets is a common approach but is time-consuming and impractical for every new requirement.	PromptIQA leverages Image-Score Pair Prompts (ISPPs) to represent specific assessment requirements. It's trained on a mixed dataset using data augmentation (random scaling and flipping) to learn diverse requirements, enabling adaptation to new ones without fine-tuning.	PromptIQA outperforms state-of-the-art IQA methods, especially on authentic distortion, face, AI-generated, and underwater IQA tasks. It demonstrates superior generalization ability on new assessment requirements simulated by FR-IQA models compared to models trained on specific datasets or with fine-tuning. Ablation studies confirm the effectiveness of the proposed components, including mixed training, prompts, and data augmentation strategies.	Performance on synthetic distortion datasets (LIVE, CSIQ) needs improvement, potentially due to differences in label distribution compared to other datasets. Future work can explore alternative prompt selection strategies and investigate the impact of prompt size more comprehensively.	nr-iqa, image quality assessment, prompts, generalization, data augmentation
2403.04965 Report	StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models	Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, Siavash Arjomand Bigdeli	The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.	This document appears to be an instructional template for a scientific paper, outlining standard sections and LaTeX formatting for elements like citations, figures, tables, and lists.	Provides a framework for writing scientific papers and ensures consistency in formatting.	Presents a structured template with placeholders (lipsum text) and example code for various elements within a LaTeX document.		Content is placeholder text and lacks concrete research findings. Limited to LaTeX formatting and doesn't cover research methodology or analysis.	latex, academic writing, template, scientific paper, formatting
2403.04926 Report	BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling	Cheng Peng, Yutao Tang, Yifan Zhou, Nengyu Wang, Xijun Liu, Deming Li, Rama Chellappa	Recent efforts in using 3D Gaussians for scene reconstruction and novel view synthesis can achieve impressive results on curated benchmarks; however, images captured in real life are often blurry. In this work, we analyze the robustness of Gaussian-Splatting-based methods against various image blur, such as motion blur, defocus blur, downscaling blur, \etc. Under these degradations, Gaussian-Splatting-based methods tend to overfit and produce worse results than Neural-Radiance-Field-based methods. To address this issue, we propose Blur Agnostic Gaussian Splatting (BAGS). BAGS introduces additional 2D modeling capacities such that a 3D-consistent and high quality scene can be reconstructed despite image-wise blur. Specifically, we model blur by estimating per-pixel convolution kernels from a Blur Proposal Network (BPN). BPN is designed to consider spatial, color, and depth variations of the scene to maximize modeling capacity. Additionally, BPN also proposes a quality-assessing mask, which indicates regions where blur occur. Finally, we introduce a coarse-to-fine kernel optimization scheme; this optimization scheme is fast and avoids sub-optimal solutions due to a sparse point cloud initialization, which often occurs when we apply Structure-from-Motion on blurry images. We demonstrate that BAGS achieves photorealistic renderings under various challenging blur conditions and imaging geometry, while significantly improving upon existing approaches.	This paper introduces Blur Agnostic Gaussian Splatting (BAGS), a novel method addressing the sensitivity of Gaussian Splatting-based scene reconstruction to blurry images.	Gaussian Splatting, while efficient, struggles with real-world blurry images. BAGS improves robustness by separating multi-view consistent scenes from inconsistent degradations.	BAGS employs a Blur Proposal Network (BPN) to estimate per-pixel convolution kernels and masks, considering spatial, color, and depth variations. A coarse-to-fine optimization scheme gradually increases image resolution and kernel size for stability.	BAGS achieves state-of-the-art performance on scenes with camera motion blur, outperforming NeRF-based deblurring methods. It demonstrates significant visual improvements on defocus blur and handles mixed-resolution inputs effectively. The generated masks and kernels provide interpretable insights into degradation types and regions.	The added computational complexity of BPN requires further optimization. Future work includes exploring dynamic kernel capacity adjustment based on degradation levels.	scene reconstruction, gaussian splatting, deblurring, novel view synthesis, multi-scale optimization
2403.04880 Report	An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control	Aosong Feng, Weikang Qiu, Jinbin Bai, Kaicheng Zhou, Zhen Dong, Xiao Zhang, Rex Ying, Leandros Tassiulas	Building on the success of text-to-image diffusion models (DPMs), image editing is an important application to enable human interaction with AI-generated content. Among various editing methods, editing within the prompt space gains more attention due to its capacity and simplicity of controlling semantics. However, since diffusion models are commonly pretrained on descriptive text captions, direct editing of words in text prompts usually leads to completely different generated images, violating the requirements for image editing. On the other hand, existing editing methods usually consider introducing spatial masks to preserve the identity of unedited regions, which are usually ignored by DPMs and therefore lead to inharmonic editing results. Targeting these two challenges, in this work, we propose to disentangle the comprehensive image-prompt interaction into several item-prompt interactions, with each item linked to a special learned prompt. The resulting framework, named D-Edit, is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. Versatile image editing can then be applied to specific items by manipulating the corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.	This paper introduces D-Edit, a versatile image editing framework for diffusion models that disentangles image-prompt interactions into item-prompt associations for item-level editing.	Existing diffusion model editing methods struggle to preserve original image information and maintain consistency with editing guidance. D-Edit addresses these challenges by disentangling control and leveraging unique item prompts.	D-Edit utilizes a two-step finetuning process: first, optimizing text encoder embeddings for item prompts (special tokens), then fine-tuning UNet weights with grouped cross-attention to disentangle item-prompt interactions. This allows editing by manipulating prompts, masks, and item-prompt associations.	D-Edit enables item-level text-guided editing, surpassing null-text inversion with better detail preservation and natural transitions. It supports image-guided editing, outperforming baselines by seamlessly composing objects while retaining their identities. D-Edit allows mask-based editing (moving, reshaping, resizing, refining) and item removal, leading to natural and reasonable results.	The quality of editing relies on the accuracy of the segmentation model. Further exploration of different segmentation methods and their impact on editing is needed.	image editing, diffusion models, text-to-image, disentangled representation, item-prompt association
2403.04692 Report	PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li	In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.	This paper introduces PixArt-sigma, a Diffusion Transformer model capable of directly generating 4K resolution images with high fidelity and improved text-prompt alignment.	Existing high-quality text-to-image models require substantial resources for training, hindering innovation. This paper explores efficient methods to integrate new datasets and algorithms into pre-trained models, enabling the development of more powerful models with limited resources.	The paper leverages the pre-trained PixArt-alpha model and introduces two key advancements: (1) Training with a higher-quality dataset containing high-resolution images and detailed captions. (2) Implementing an efficient token compression method within the DiT framework to reduce computational demands for high-resolution generation.	PixArt-sigma achieves superior image quality and text-prompt alignment compared to its predecessor, PixArt-alpha, with minimal additional training cost. The model produces high-fidelity 4K images with a smaller model size (0.6B parameters) compared to existing models like SDXL (2.6B) and SD Cascade (5.1B). Human and AI preference studies demonstrate that PixArt-sigma generates high-quality images that closely adhere to user instructions, outperforming or rivaling other open-source and commercial T2I models.	The model still exhibits limitations in generating specific scenes, objects (text and hands), and perfectly aligning complex prompts. Future research should focus on data quality, model scaling, mitigating potential biases, and addressing ethical concerns.	text-to-image synthesis, diffusion models, diffusion transformer, high-resolution image generation, efficient ai
2403.04690 Report	Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level	Ali Hassani, Wen-Mei Hwu, Humphrey Shi	Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.	This paper introduces two new classes of CUDA kernels for neighborhood attention, significantly improving performance over existing implementations.	Neighborhood attention reduces the quadratic complexity of self-attention to linear complexity, but efficient implementations have been challenging, limiting its practicality.	The authors formulate neighborhood attention as a batched GEMM problem with space-aware tiling and gather/scatter fusion, enabling efficient implementation using optimized GEMM kernels. They further propose fused neighborhood attention, extending the logic to fused attention kernels for further latency and memory footprint reduction.	GEMM-based kernels achieve up to 9x speedup over naive implementations in full precision and outperform them in most benchmarks. Fused kernels consistently outperform both naive and GEMM-based kernels, with up to 16x speedup in half precision while reducing memory footprint. Model-level benchmarks demonstrate significant throughput improvements in image classification and image generation tasks using the proposed kernels.	Current implementation lacks support for backpropagation in fused kernels. Higher-rank implementations (2-D, 3-D) in fused kernels introduce some unavoidable overhead compared to single-rank (1-D) and self-attention.	neighborhood attention, self attention, cuda, gemm, fused kernel
2403.04634 Report	Pix2Gif: Motion-Guided Diffusion for GIF Generation	Hitesh Kandala, Jianfeng Gao, Jianwei Yang	We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: https://hiteshk03.github.io/Pix2Gif/.	Presents Pix2Gif, a motion-guided diffusion model for generating GIFs from a single image using text and motion magnitude prompts, framing the task as an image translation problem.	Addresses limitations in existing video generation models that compromise resolution and fine-grained temporal control by enabling high-resolution GIF generation with precise motion guidance.	Leverages latent diffusion models (LDMs) and introduces a motion-guided warping module to transform source image features based on motion prompts, ensuring consistency with perceptual loss and training on a curated TGIF dataset.	Pix2Gif generates GIFs with superior temporal coherence compared to state-of-the-art methods. The model demonstrates enhanced controllability over motion dynamics in generated GIFs. Pix2Gif exhibits emergent capabilities for combining different actions based on complex text prompts.	Limited resolution (256x256 pixels) of generated frames. Training dataset size is limited due to computational constraints, potentially affecting model performance.	gif generation, motion-guided diffusion, image-to-image translation, temporal coherence, video generation
2403.04493 Report	What makes an image realistic?	Lucas Theis	The last decade has seen tremendous progress in our ability to generate realistic-looking data, be it images, text, audio, or video. Here, we discuss the closely related problem of quantifying realism, that is, designing functions that can reliably tell realistic data from unrealistic data. This problem turns out to be significantly harder to solve and remains poorly understood, despite its prevalence in machine learning and recent breakthroughs in generative AI. Drawing on insights from algorithmic information theory, we discuss why this problem is challenging, why a good generative model alone is insufficient to solve it, and what a good solution would look like. In particular, we introduce the notion of a universal critic, which unlike adversarial critics does not require adversarial training. While universal critics are not immediately practical, they can serve both as a North Star for guiding practical implementations and as a tool for analyzing existing attempts to capture realism.	This paper argues that quantifying the realism of data, such as images, can be understood as determining its randomness deficiency, drawing parallels with algorithmic information theory.	Defining and measuring realism is crucial for various machine learning applications, including anomaly detection, deepfake detection, and generative model evaluation, yet it remains a challenging and poorly understood problem.	The paper leverages the concept of randomness deficiency from algorithmic information theory, proposing "universal critics" based on Kolmogorov complexity and Solomonoff's probability to quantify realism.	Randomness deficiency, defined as the difference between negative log-probability and Kolmogorov complexity, effectively captures realism. Batched universal critics, processing multiple independent samples, provide tighter bounds for realism evaluation and generalize both no-reference metrics and divergences. The concept of universal critics sheds light on the success of existing techniques like score distillation sampling, suggesting new avenues for improvement.	Kolmogorov complexity and Solomonoff's probability are uncomputable, necessitating practical approximations for real-world applications. Further research is needed to explore efficient and robust approximations to universal critics for optimization tasks.	perceptual quality, realism, neural compression, generative adversarial networks, algorithmic information theory
2403.04437 Report	StableDrag: Stable Dragging for Point-based Image Editing	Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, Limin Wang	Point-based image editing has attracted remarkable attention since the emergence of DragGAN. Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models. Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. The former allows us to precisely locate the updated handle points, thereby boosting the stability of long-range manipulation, while the latter is responsible for guaranteeing the optimized latent as high-quality as possible across all the manipulation steps. Thanks to these unique designs, we instantiate two types of image editing models including StableDrag-GAN and StableDrag-Diff, which attains more stable dragging performance, through extensive qualitative experiments and quantitative assessment on DragBench.	This paper presents StableDrag, a stable dragging framework for point-based image editing, improving upon previous methods like DragGAN and DragDiffusion.	Existing dragging techniques suffer from inaccurate point tracking and incomplete motion supervision, leading to unsatisfactory editing outcomes.	StableDrag introduces a discriminative point tracking method using a learned convolutional filter to better locate updated handle points. It also employs a confidence-based latent enhancement strategy during motion supervision to ensure high-quality optimization at each step.	StableDrag accurately moves handle points to target points, even for long-range manipulations. It generates higher-quality editing results, preserving image fidelity and content consistency. Quantitative evaluation on DragBench shows StableDrag outperforms DragDiffusion in both accuracy and image quality.	The current implementation relies on a local search strategy during point tracking, limiting its ability to differentiate between very similar objects. Future work includes exploring global tracking strategies and adapting StableDrag to other generative models.	image editing, generative models, stable diffusion, draggan, point tracking
2403.04321 Report	Discriminative Probing and Tuning for Text-to-Image Generation	Leigang Qu, Wenjie Wang, Yongqi Li, Hanwang Zhang, Liqiang Nie, Tat-Seng Chua	Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.	This paper proposes DPT, a novel paradigm to enhance text-image alignment in text-to-image generation models by improving their discriminative abilities.	Existing text-to-image generation models often struggle with accurately aligning generated images with text prompts, especially in complex scenes. This misalignment issue hinders the generation of high-quality images that faithfully reflect the input text.	DPT is a two-stage process. Stage 1 (Discriminative Probing) assesses the model's discriminative abilities on Image-Text Matching (ITM) and Referring Expression Comprehension (REC) tasks using a lightweight Discriminative Adapter. Stage 2 (Discriminative Tuning) improves these abilities through parameter-efficient fine-tuning using LoRA, focusing on enhancing both generative and discriminative performance. Additionally, a self-correction mechanism guides image generation towards better alignment during inference.	DPT significantly improves text-image alignment across five diverse T2I benchmarks, outperforming existing state-of-the-art methods in terms of alignment accuracy. The study reveals that text-to-image generation models possess inherent discriminative abilities (global matching and local grounding), which can be effectively enhanced through discriminative tuning. The proposed self-correction mechanism effectively guides the generation process towards images better aligned with the text prompts.	The study primarily focuses on two specific discriminative tasks (ITM and REC). Exploring the impact of other discriminative tasks on text-to-image generation could be beneficial. Balancing multi-task learning objectives, especially those related to generation and discrimination, requires further investigation to prevent potential conflicts during optimization.	text-to-image generation, text-image alignment, discriminative probing, discriminative tuning, self-correction
2403.04306 Report	Effectiveness Assessment of Recent Large Vision-Language Models	Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan	The advent of large vision-language models (LVLMs) represents a noteworthy advancement towards the pursuit of artificial general intelligence. However, the model efficacy across both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their efficacy in specialized tasks, we employ six challenging tasks across three distinct application scenarios, namely natural, healthcare, and industrial ones. Such six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization under these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope this study could provide useful insights for the future development of LVLMs, helping researchers improve LVLMs to cope with both general and specialized applications.	This paper presents a comprehensive evaluation of popular large vision-language models (LVLMs) on both specialized and general vision-language tasks.	Understanding the strengths and limitations of LVLMs in handling specialized and general tasks is crucial for guiding future research and development towards artificial general intelligence.	The authors evaluate three open-source LVLMs (MiniGPT-v2, LLaVA-1.5, and Shikra) on six specialized tasks and five general tasks. They quantitatively analyze model performance using established metrics and qualitatively examine failure cases to identify potential reasons for inadequacy.	LVLMs show promising but insufficient performance on specialized tasks due to limited domain knowledge and common weaknesses like object hallucination. Shikra and MiniGPT-v2 exhibit better localization capabilities than LLaVA-1.5, particularly in natural scenarios. In general tasks, all evaluated LVLMs exhibit significant room for improvement, particularly in object counting, spatial reasoning, and absurd question answering.	The evaluation primarily focuses on a limited number of open-source LVLMs. Future work should explore effective strategies like prompt engineering and model optimization to improve LVLMs' performance on specialized tasks.	large vision-language models, multi-modal understanding, specialized vision tasks, object hallucination, prompt engineering
2403.04303 Report	LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking	Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, Yong Liu, Chengjie Wang	Deep learning models, particularly those based on transformers, often employ numerous stacked structures, which possess identical architectures and perform similar functions. While effective, this stacking paradigm leads to a substantial increase in the number of parameters, posing challenges for practical applications. In today's landscape of increasingly large models, stacking depth can even reach dozens, further exacerbating this issue. To mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS allows stacked modules to share the majority of parameters, requiring a much smaller number of unique ones per module to match or even surpass the performance of using entirely distinct ones, thereby significantly reducing parameter usage. We validate our method by applying it to the stacked decoders of a query-based object detector, and conduct extensive experiments on the widely used MS COCO dataset. Experimental results demonstrate the effectiveness of our method, as even with a 70\% reduction in the parameters of the decoder, our method still enables the model to achieve comparable or	This paper proposes a novel Low-rank Residual Structure (LORS) for parameter reduction in deep learning models with stacked structures. LORS decomposes parameters into shared and private components, significantly reducing overall parameter usage without compromising performance.	Stacking structures, while effective, significantly increase parameter count, posing challenges for training, inference, and deployment. LORS addresses this issue by promoting parameter sharing.	LORS is formulated mathematically for both adaptive and static parameters, utilizing low-rank decomposition and residual connections. The approach is validated by applying it to the stacked decoders of AdaMixer, a query-based object detector.	LORS reduced AdaMixer's decoder parameters by up to 70% while maintaining or even improving performance on the MS COCO dataset. Both adaptive and static LORS effectively reduced parameters without compromising performance. Experiments showed that shared and private weights are crucial, and the optimal configuration for LORS depends on the specific task and model.	While effective, LORS requires a relatively long training process to fully realize its potential. The current LORS implementation slightly increases inference time due to serial and redundant computations, necessitating further optimization.	parameter reduction, deep learning, stacked structures, object detection, low-rank decomposition
2403.04279 Report	Controllable Generation with Text-to-Image Diffusion Models: A Survey	Pu Cao, Feng Zhou, Qing Song, Lu Yang	In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at \url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}.	This paper presents a survey of controllable generation techniques using text-to-image diffusion models, focusing on how novel conditions beyond text prompts can steer the image generation process.	Achieving fine-grained control over image generation is crucial for various applications. This survey provides a comprehensive overview of the rapidly developing field of controllable generation with T2I diffusion models.	The paper categorizes existing methods based on condition types and analyzes two core controlling mechanisms: conditional score prediction and condition-guided score estimation.	The survey introduces a structured taxonomy for classifying controllable generation methods based on condition types. It provides an in-depth analysis of how conditional score prediction and condition-guided score estimation methods incorporate novel conditions into T2I models. The paper highlights the diverse applications of conditional generation, demonstrating its impact on various tasks such as image manipulation, completion, and 3D generation.	The paper primarily focuses on analyzing existing methods and their applications, leaving the exploration of potential future directions for controllable generation with T2I diffusion models as an open question. The survey primarily focuses on image generation, leaving the exploration of conditional generation in other domains like video and 3D as a potential area for future investigation.	text-to-image generation, diffusion models, controllable generation, conditional image synthesis, generative ai
2403.04200 Report	ACC-ViT : Atrous Convolution's Comeback in Vision Transformers	Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara	Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized, hybrid vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$ improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters. In addition, we have investigated the efficacy of ACC-ViT backbone under different evaluation settings, such as finetuning, linear probing, and zero-shot learning on tasks involving medical image analysis, object detection, and language-image contrastive learning. ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.	This paper introduces Atrous Attention, a novel attention mechanism for vision transformers, and proposes ACC-ViT, a hybrid vision transformer architecture based on this mechanism, inspired by atrous convolution.	The proposed ACC-ViT aims to address the limitations of regional and sparse attention mechanisms in vision transformers by consolidating both local and global information while preserving hierarchical relations, thereby enhancing visual representation.	The methodology involves designing Atrous Attention by emulating atrous convolution for sparse regional attention, implementing a gating operation for adaptive fusion of hierarchical features, using a shared MLP layer across parallel attentions for efficiency, and proposing Parallel Atrous Inverted Residual Convolution blocks.	ACC-ViT achieves state-of-the-art performance, outperforming MaxViT and MOAT on ImageNet-1K with a tiny version achieving 83.97% accuracy. The model exhibits strong transfer learning capabilities, surpassing baselines on medical image datasets (HAM10000, EyePACS, BUSI). ACC-ViT demonstrates competence as a frozen backbone for object detection and shows promising zero-shot performance on the ELEVATER benchmark.	Limitations include computational constraints preventing pretraining on larger datasets (ImageNet-21K) and developing larger models. Future work involves exploring the full potential of ACC-ViT by scaling up the model and evaluating it on higher-resolution images.	vision transformer, atrous attention, atrous convolution, hybrid architecture, transfer learning
2403.03485 Report	NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging	Takahiro Shirakawa, Seiichi Uchida	Layout-aware text-to-image generation is a task to generate multi-object images that reflect layout conditions in addition to text conditions. The current layout-aware text-to-image diffusion models still have several issues, including mismatches between the text and layout conditions and quality degradation of generated images. This paper proposes a novel layout-aware text-to-image diffusion model called NoiseCollage to tackle these issues. During the denoising process, NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words, it can put the right objects in the right places. Qualitative and quantitative evaluations show that NoiseCollage outperforms several state-of-the-art models. These successful results indicate that the crop-and-merge operation of noises is a reasonable strategy to control image generation. We also show that NoiseCollage can be integrated with ControlNet to use edges, sketches, and pose skeletons as additional conditions. Experimental results show that this integration boosts the layout accuracy of ControlNet. The code is available at https://github.com/univ-esuty/noisecollage.	This paper proposes NoiseCollage, a novel training-free layout-aware text-to-image diffusion model for generating multi-object images by independently estimating and then cropping and merging noises for individual objects.	Current layout-aware text-to-image diffusion models suffer from mismatches between text and layout conditions and image quality degradation. NoiseCollage tackles these issues by its unique noise manipulation strategy.	NoiseCollage leverages a pre-trained diffusion model like Stable Diffusion. It estimates noises for individual objects independently, then crops and merges them into a single noise for final image generation. It uses masked cross-attention to localize visual information of each object and allows overlapping layout conditions with a weighted merging operation.	NoiseCollage generates high-quality multi-object images with accurate alignment between objects and their text/layout conditions. Qualitative and quantitative evaluations demonstrate NoiseCollage's superior performance over state-of-the-art methods, showing less condition mismatches and better image quality. Integrating ControlNet into NoiseCollage enables finer control over object appearances through edge maps, sketches, and pose skeletons while maintaining layout accuracy.	NoiseCollage occasionally struggles to accurately generate small objects. Future work includes enabling automatic layout inference from text conditions, support for point annotations, and exploring further noise manipulation techniques for tasks like video generation.	text-to-image generation, diffusion models, layout-aware generation, noise manipulation, controlnet
2403.03431 Report	Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing	Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang	Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative Text-to-image generation. Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers, which modify objects or object properties in images by manipulating feature components in attention layers during the generation process. However, little is known about what semantic meanings these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information that can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention maps in diffusion models. Moreover, based on our findings, we simplify popular image editing methods and propose a more straightforward yet more stable and efficient tuning-free procedure that only modifies self-attention maps of the specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets.	This paper presents Free-Prompt-Editing (FPE), a simplified and efficient method for tuning-free text-guided image editing in diffusion models by modifying self-attention maps during the denoising process.	Domain-specific image editing often requires modifying objects or properties in images, making tuning-free methods crucial for developers. However, existing approaches have limitations, such as unstable results due to cross-attention manipulation and high computational costs.	The authors conduct a probing analysis of cross- and self-attention maps in Stable Diffusion, revealing that cross-attention maps contain object attribution information leading to editing failures, while self-attention maps preserve geometric and shape details. Based on these findings, FPE replaces specific self-attention maps during denoising, leveraging cross-attention for image-prompt alignment and self-attention for preserving source image structure.	Cross-attention maps in Stable Diffusion contain object attribution information, making their manipulation prone to editing failures. Self-attention maps are crucial for preserving the original image's structure during editing. FPE consistently outperforms existing tuning-free methods on multiple datasets while being more efficient.	FPE is limited by the generative capabilities of the underlying TIS model. Reconstruction of real images can result in loss of detail due to limitations of the VQ autoencoder.	image editing, text-guided image editing, diffusion models, stable diffusion, attention mechanisms
2403.02981 Report	Doubly Abductive Counterfactual Inference for Text-based Image Editing	Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang	We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.	This paper introduces Doubly Abductive Counterfactual (DAC), a novel framework for text-based image editing that leverages counterfactual inference to achieve a better trade-off between editability and fidelity compared to existing methods.	Text-based image editing (TBIE) is challenging because existing techniques struggle to balance preserving the original image's fidelity while effectively incorporating textual edits. This paper provides a theoretical framework, counterfactual inference, to formally define TBIE and address this challenge.	DAC uses a two-step abduction process. First, it parameterizes an exogenous variable as a UNet LoRA to encode image details (fidelity). Second, it introduces another exogenous variable, a text encoder LoRA, to recover editing capabilities lost due to overfitting in the first abduction. The method then inverts the second abduction to apply the semantic change, achieving the desired edit.	DAC achieves a good balance between editability and fidelity, outperforming existing methods in qualitative and quantitative evaluations. The method supports a wide range of editing intents, including addition, removal, manipulation, replacement, style transfer, and facial changes. Ablation studies confirm the importance of the two-step abduction process, annealing strategy, and specific LoRA parameterization for optimal performance.	The method's reliance on stable diffusion as the generative model introduces limitations related to random seed sensitivity, comprehension of referring expressions, and lack of common sense. Multi-turn editing leads to gradual degradation in image quality due to information loss during abduction.	text-based image editing, counterfactual inference, stable diffusion, lora, image manipulation
2403.02827 Report	Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation	Weijie Li, Litong Gong, Yiran Zhu, Fanda Fan, Biao Wang, Tiezheng Ge, Bo Zheng	Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io.	This paper proposes a noise rectification method for high-fidelity image-to-video generation, addressing the limitations of existing approaches in maintaining detail and mitigating noise prediction biases.	Generating high-fidelity videos from still images is challenging, with existing methods struggling to maintain detail and suffering from noise accumulation during the denoising process.	The method utilizes a "noising and rectified denoising" approach. It first adds noise to the input image latent. Then, it rectifies the predicted noise during denoising by leveraging the known initial noise, striking a balance between fidelity and motion.	The method outperforms existing image-to-video generation techniques in preserving fine-grained details and achieving higher fidelity. Ablation studies demonstrate the impact of rectification weight and timestep on fidelity and motion. The method is shown to be plug-and-play, effectively extending various text-to-video frameworks for high-fidelity image-to-video generation.	The method, while excelling in fidelity, may lead to a slight reduction in motion intensity. Future work will focus on enhancing motion intensity while preserving the achieved high fidelity.	image-to-video generation, diffusion models, noise rectification, fidelity enhancement, open-domain video synthesis
2403.02799 Report	DPPA: Pruning Method for Large Language Model to Model Merging	Yaochen Zhu, Rui Xia, Jiajun Zhang	Model merging is to combine fine-tuned models derived from multiple domains, with the intent of enhancing the model's proficiency across various domains. The principal concern is the resolution of parameter conflicts. A substantial amount of existing research remedy this issue during the merging stage, with the latest study focusing on resolving this issue throughout the pruning stage. The DARE approach has exhibited promising outcomes when applied to a simplistic fine-tuned model. However, the efficacy of this method tends to wane when employed on complex fine-tuned models that show a significant parameter bias relative to the baseline model. In this paper, we introduce a dual-stage method termed Dynamic Pruning Partition Amplification (DPPA), devised to tackle the challenge of merging complex fine-tuned models. Initially, we introduce Dynamically Pruning (DP), an improved approach based on magnitude pruning, which aim is to enhance performance at higher pruning rates. Subsequently, we propose Dynamically Partition Amplification (DPA), a rescaling strategy, is designed to dynamically amplify parameter partitions in relation to their significance levels. The experimental results show that our method maintains a mere 20% of domain-specific parameters and yet delivers a performance comparable to other methodologies that preserve up to 90% of parameters. Furthermore, our method displays outstanding performance post-pruning, leading to a significant improvement of nearly 20% performance in model merging. We make our code on Github.	This paper presents DPPA, a dual-stage method for merging large language models fine-tuned on different domains by addressing parameter conflicts through a novel pruning and rescaling strategy.	Model merging aims to combine domain-specific models into a single model with multi-domain capabilities. However, parameter conflicts between models often lead to performance degradation, which DPPA aims to mitigate.	DPPA first employs Dynamic Pruning (DP) to prune less significant parameters based on their magnitudes at layer and linear layer levels. Then, it uses Dynamic Partition Amplification (DPA) to dynamically rescale the remaining parameters based on their importance derived from pruning rates.	DPPA retains only 20% of domain-specific parameters while achieving comparable performance to other methods retaining 90% of parameters. DPPA outperforms the state-of-the-art merging method DARE by nearly 20% in performance. Analysis suggests DPPA implicitly partitions parameters by dimensions, allowing it to restore domain-specific capabilities by amplifying important dimensions.	DPPA's performance is suboptimal for models with minor differences compared to the base model. DPA requires significant time to find the optimal rescaling ratio.	model merging, large language models, pruning, rescaling, parameter conflicts
2403.02775 Report	EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs	Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang	Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.	This paper proposes "EasyQuant", a training-free and data-free weight quantization algorithm for Large Language Models (LLMs) that isolates outliers in weight from quantization and optimizes quantization ranges to improve performance.	LLMs are computationally and memory intensive. Quantization reduces these overheads, but existing methods suffer from generalization issues due to data-dependent calibration. This work aims for a data-free approach to guarantee generalization performance.	EasyQuant identifies outliers in the weight matrices using a sigma-based criterion and keeps them unquantized. It then optimizes the quantization ranges for the remaining weights by minimizing reconstruction error using gradient descent.	EasyQuant achieves comparable performance to the original full-precision LLMs after quantization. It significantly outperforms naive Round-to-Nearest (RTN) quantization in a data-free setting. EasyQuant shows better performance than data-dependent algorithms like GPTQ on several benchmarks.	The outlier recovery in EasyQuant requires additional CUDA kernels. It focuses on weight-only quantization and doesn't address the computational cost reduction, leaving latency minimization for future work.	model quantization, large language models, data-free quantization, outlier isolation, quantization range optimization
2403.02677 Report	Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters	Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng Yan, Heng Wang	We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.	The paper proposes using fine-tuned Multimodal Language Models (MLMs) as data filters to improve the quality of image-text datasets for training Vision-Language Models (VLMs).	Existing methods like CLIPScore rely on holistic image-text alignment and struggle to capture fine-grained details, limiting the quality of filtered data and downstream VLM performance.	The authors fine-tune open-source MLMs on a dataset constructed using proprietary LLMs (GPT-4, GPT-4V) to score image-text pairs across four metrics: Image-Text Matching, Object Detail Fulfillment, Caption Text Quality, and Semantic Understanding. Different design choices for data construction and filtering metrics are evaluated on the DataComp benchmark.	MLM filters significantly outperform CLIPScore on DataComp, achieving 1.7% higher average accuracy over 38 datasets. Combining multiple MLM-based metrics (ITM and ODF) further improves filtering performance. MLM filter scores demonstrate stronger correlation with human judgment compared to CLIPScore.	Limited effectiveness of certain metrics (CTQ, SU) on classification-focused benchmarks. Computational cost of MLM filtering despite acceleration efforts.	data filtering, multimodal language models, vision-language models, image-text alignment, data quality
2403.02580 Report	What do we learn from inverting CLIP models?	Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein	We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like "a beautiful landscape," as well as for prompts involving the names of celebrities.	This paper investigates the capabilities and biases of CLIP models through model inversion, revealing insights into their ability to blend concepts, presence of NSFW content, and gender biases.	Understanding the capabilities and biases of CLIP models is crucial due to their widespread use in various AI applications, including text-to-image generation.	The study employs an inversion-based approach, optimizing input images to align with given textual prompts. It utilizes techniques like augmentations, ensembling, and regularization terms to generate meaningful inversions.	CLIP models demonstrate a capacity to blend concepts, generating images that accurately combine multiple ideas from a given prompt. Model inversion reveals the presence of NSFW content within CLIP models, even for seemingly innocuous prompts, suggesting limitations in training data curation. CLIP models exhibit gender bias, particularly in associating professions and social statuses with specific genders.	The study acknowledges limitations in using generative strategies to analyze a model not primarily intended for generative tasks. Future work could explore addressing NSFW content generation stemming from CLIP embeddings in text-to-image generation models.	clip, model inversion, nsfw content, gender bias, text-to-image generation
2403.02473 Report	When do Convolutional Neural Networks Stop Learning?	Sahan Ahmad, Gabriel Trahan, Aminul Islam	Convolutional Neural Networks (CNNs) have demonstrated outstanding performance in computer vision tasks such as image classification, detection, segmentation, and medical image analysis. In general, an arbitrary number of epochs is used to train such neural networks. In a single epoch, the entire training data -- divided by batch size -- are fed to the network. In practice, validation error with training loss is used to estimate the neural network's generalization, which indicates the optimal learning capacity of the network. Current practice is to stop training when the training loss decreases and the gap between training and validation error increases (i.e., the generalization gap) to avoid overfitting. However, this is a trial-and-error-based approach which raises a critical question: Is it possible to estimate when neural networks stop learning based on training data? This research work introduces a hypothesis that analyzes the data variation across all the layers of a CNN variant to anticipate its near-optimal learning capacity. In the training phase, we use our hypothesis to anticipate the near-optimal learning capacity of a CNN variant without using any validation data. Our hypothesis can be deployed as a plug-and-play to any existing CNN variant without introducing additional trainable parameters to the network. We test our hypothesis on six different CNN variants and three different general image datasets (CIFAR10, CIFAR100, and SVHN). The result based on these CNN variants and datasets shows that our hypothesis saves 58.49\% of computational time (on average) in training. We further conduct our hypothesis on ten medical image datasets and compared with the MedMNIST-V2 benchmark. Based on our experimental result, we save $\approx$ 44.1\% of computational time without losing accuracy against the MedMNIST-V2 benchmark.	This paper introduces a hypothesis and method to anticipate the near-optimal learning capacity of a Convolutional Neural Network (CNN) during training, potentially saving computational time by stopping training earlier.	Selecting the number of training epochs for CNNs is currently a trial-and-error process that relies on monitoring validation error, which may not be reliable and incurs extra computational cost. This method aims to address this by predicting when the model stops learning significantly from the training data.	The method analyzes data variation after the convolution operation in each layer of the CNN across epochs. It introduces the concept of a "stability vector" for each layer, which tracks the standard deviation of data after convolution for each iteration in an epoch. By comparing the mean stability vectors of consecutive epochs, the method determines when the data variation stabilizes, implying the model has reached its near-optimal learning capacity.	The proposed hypothesis, when applied to six different CNN architectures and three image datasets, saves 32% to 79% of the computational time compared to using a fixed 200 epochs. The method achieves comparable testing accuracy to traditional training with validation data. Analysis of data variation patterns across layers provides insights into the learning dynamics of CNNs and supports the hypothesis that stability indicates near-optimal learning capacity.	The method relies on a heuristic choice of rounding decimal places when comparing mean stability vectors, potentially limiting its generalizability. Further investigation is needed to apply the method to other deep neural networks beyond CNNs.	optimization, cnn, deep learning, image classification, early stopping
2403.02460 Report	MagicClay: Sculpting Meshes With Generative Neural Fields	Amir Barda, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix	The recent developments in neural fields have brought phenomenal capabilities to the field of shape generation, but they lack crucial properties, such as incremental control - a fundamental requirement for artistic work. Triangular meshes, on the other hand, are the representation of choice for most geometry related tasks, offering efficiency and intuitive control, but do not lend themselves to neural optimization. To support downstream tasks, previous art typically proposes a two-step approach, where first a shape is generated using neural fields, and then a mesh is extracted for further processing. Instead, in this paper we introduce a hybrid approach that maintains both a mesh and a Signed Distance Field (SDF) representations consistently. Using this representation, we introduce MagicClay - an artist friendly tool for sculpting regions of a mesh according to textual prompts while keeping other regions untouched. Our framework carefully and efficiently balances consistency between the representations and regularizations in every step of the shape optimization; Relying on the mesh representation, we show how to render the SDF at higher resolutions and faster. In addition, we employ recent work in differentiable mesh reconstruction to adaptively allocate triangles in the mesh where required, as indicated by the SDF. Using an implemented prototype, we demonstrate superior generated geometry compared to the state-of-the-art, and novel consistent control, allowing sequential prompt-based edits to the same mesh for the first time.	Introduces MagicChisel, a tool for sculpting regions of a mesh based on text prompts, using a hybrid mesh-SDF representation.	Combines the advantages of neural fields (robust generation) and meshes (efficiency, control), enabling localized and sequential prompt-based mesh editing.	Jointly optimizes a mesh and SDF, using score distillation sampling from text prompts. Employs differentiable rendering, consistency losses, and dynamic topology updates via ROAR.	Generates smoother and higher-quality geometry than existing text-to-3D methods. Enables localized mesh edits according to textual prompts, preserving unedited regions. Outperforms text-driven mesh deformation baselines in terms of expressiveness and control.	Limited by the quality and noise of SDS gradients. Computationally expensive, taking around 1 hour per prompt on an A100 GPU.	3d shape generation, text-guided editing, hybrid representations, mesh sculpting, score distillation sampling
2403.02332 Report	UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control	Xuweiyi Chen, Tian Xia, Sihan Xu	Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.	The paper introduces UniCtrl, a training-free, plug-and-play method to enhance the spatiotemporal consistency and motion diversity of videos generated by text-to-video diffusion models.	Existing text-to-video diffusion models struggle to maintain consistency across frames, especially when guided by text prompts, leading to discrepancies in generated content over time.	UniCtrl leverages a three-pronged approach: 1) Cross-Frame Self-Attention Control ensures semantic consistency by applying keys and values from the first frame to subsequent frames, 2) Motion Injection preserves motion dynamics by using original queries for spatial information, and 3) Spatiotemporal Synchronization enhances coherence by synchronizing latent representations between frames.	UniCtrl significantly improves spatiotemporal consistency across different text-to-video models, as evidenced by quantitative metrics like DINO. The method effectively preserves motion diversity within generated videos, surpassing baseline models and alternative approaches in metrics like RAFT. UniCtrl demonstrates strong compatibility with existing enhancement techniques, as shown by its successful integration with FreeInit for further improvements.	UniCtrl's reliance on the attention mechanism limits its applicability to non-attention-based models. The method's constraint of using the same values for each frame restricts its ability to generate videos with varying colors across frames.	video diffusion, spatiotemporal consistency, attention control, text-to-video generation, motion preservation
2403.02325 Report	Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training	David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal	Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.	This paper introduces Contrastive Region Guidance (CRG), a training-free method to improve visual grounding in vision-language models (VLMs) by leveraging classifier-free guidance (CFG) to focus on specific image regions.	Current methods for incorporating visual prompts into VLMs either rely on proprietary, expensive models like GPT-4V or require costly finetuning on datasets with visual prompts. CRG addresses these limitations by offering a training-free approach compatible with various existing models.	CRG contrasts the VLM's output distribution on the original image with its output on a masked version where specific regions are blacked out. This contrast highlights the importance of the masked region for the model's prediction.	CRG significantly improves visual prompt following, matching the performance of fine-tuned models on ViP-Bench and even outperforming them in some categories. CRG enhances spatial reasoning on the challenging 'Set of 4' setting of the WhatsUp benchmark, leading to substantial accuracy gains over baseline models. CRG leads to improvements in compositional generalization, boosting performance on the challenging SugarCrepe benchmark and demonstrating the method's ability to enhance models' understanding of language compositionality.	CRG requires running the VLM twice (on original and masked images), leading to increased computational cost compared to inference without CRG. The current implementation of CRG relies on object detection models to propose bounding boxes when visual markers are absent. Integrating better visual encoders could further improve efficiency and accuracy.	visual grounding, vision-language models, visual prompting, classifier-free guidance, compositional generalization
2403.02234 Report	3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors	Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu	We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototyping. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and large language models. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia	3DTopia, a two-stage text-to-3D generation system using hybrid diffusion priors, enabling fast prototyping and high-quality 3D generation.	Generating 3D assets from text is crucial for various applications but challenging due to limited data and computational demands. Existing methods compromise either speed or quality.	The first stage employs a tri-plane latent diffusion model trained on a captioned and cleaned Objaverse dataset for fast coarse 3D generation. The second stage refines texture using Score Distillation Sampling with latent-space and pixel-space 2D diffusion priors.	3DTopia outperforms Point-E and Shap-E in text-to-3D generation quality, even with less training data. The proposed 3D captioning pipeline, leveraging LLaVA and GPT-3.5, produces more detailed and accurate captions compared to existing methods. Hybrid refinement using both latent-space and pixel-space diffusion priors achieves a balance between texture diversity and quality.	Limited ability to handle complex, concept-mixing text prompts due to the lack of strong 2D priors in the first stage. Dependence on the quality of the first stage mesh for refinement.	text-to-3d generation, diffusion models, 3d captioning, score distillation sampling, tri-plane representation
2403.02151 Report	TripoSR: Fast 3D Object Reconstruction from a Single Image	Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao	This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exhibits superior performance, both quantitatively and qualitatively, compared to other open-source alternatives. Released under the MIT license, TripoSR is intended to empower researchers, developers, and creatives with the latest advancements in 3D generative AI.	TripoSR, a fast feed-forward 3D reconstruction model leveraging transformer architecture for high-quality 3D mesh generation from single images in under 0.5 seconds.	Addresses limitations of slow generation speed and control challenges in existing 3D generation methods, enabling efficient and scalable 3D model creation.	Builds upon the LRM architecture with improvements in data curation, rendering, model design (triplane channel optimization), and training techniques (mask loss, local rendering supervision).	Outperforms state-of-the-art methods on GSO and OmniObject3D datasets in terms of CD and F-score metrics. Achieves superior reconstruction quality for both shape and texture details compared to baselines. Maintains fast inference speed, producing a 3D mesh in approximately 0.5 seconds on an NVIDIA A100 GPU.	Reliance on high-resolution rendering for supervision may pose computational challenges. Future work could explore extending the model for multi-view 3D reconstruction or text-to-3D generation.	3d reconstruction, transformer, nerf, single image, generative ai
2403.02118 Report	Position: Towards Implicit Prompt For Text-To-Image Models	Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo	Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2,000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models' capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.	This paper introduces the concept of "implicit prompts" in text-to-image generation, which describe targets without directly naming them. It presents ImplicitBench, a benchmark to evaluate the capabilities and risks of T2I models in handling such prompts.	Existing T2I benchmarks primarily focus on explicit prompts, neglecting the potential of implicit prompts to enhance creativity and the risks they pose to safety constraints. This work aims to bridge this gap and advocate for responsible development in the field.	The authors curated ImplicitBench spanning three aspects: General Symbols, Celebrity Privacy, and NSFW Issues. They evaluated six popular T2I models on this benchmark using tailored evaluation methods, combining MLLMs, face recognition, and safety checkers.	T2I models demonstrate a promising ability to interpret and generate images from implicit prompts, particularly for general symbols. Implicit prompts can bypass safety filters, enabling the generation of content that infringes on celebrity privacy or falls under NSFW categories. The risk of generating unsafe content through implicit prompts is amplified by the use of specific terminologies, detailed descriptions, and ambiguous language.	The definition and scope of "implicit prompts" are still under exploration and require further refinement. Developing robust safety mechanisms and policy constraints tailored for implicit prompts is crucial to mitigate potential risks.	text-to-image generation, implicit prompts, benchmarking, safety constraints, ethical considerations
2403.02084 Report	ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models	Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Min Zheng, Lean Fu	Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the Resolution Adapter (ResAdapter), a domain-consistent adapter designed for diffusion models to generate images with unrestricted resolutions and aspect ratios. Unlike other multi-resolution generation methods that process images of static resolution with complex post-process operations, ResAdapter directly generates images with the dynamical resolution. Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain. Comprehensive experiments demonstrate that ResAdapter with only 0.5M can process images with flexible resolutions for arbitrary diffusion models. More extended experiments demonstrate that ResAdapter is compatible with other modules (e.g., ControlNet, IP-Adapter and LCM-LoRA) for image generation across a broad range of resolutions, and can be integrated into other multi-resolution model (e.g., ElasticDiffusion) for efficiently generating higher-resolution images. Project link is https://res-adapter.github.io	This paper proposes ResAdapter, a plug-and-play adapter for diffusion models that enables generation of images with unrestricted resolutions and aspect ratios while preserving the original style domain.	Current text-to-image models struggle to generate consistent images outside their trained resolution, impacting fidelity and composition. Existing methods are either computationally expensive or disrupt the original style domain of personalized models.	ResAdapter utilizes ResCLoRA for resolution interpolation, dynamically matching the receptive field of convolution layers to feature map size. ResENorm addresses resolution extrapolation by adapting normalization layers to handle statistical distribution in higher-resolution images. It is trained on a mixed-resolution dataset with a sampling strategy favoring lower and higher resolutions.	ResAdapter generates higher quality multi-resolution images compared to MultiDiffusion and ElasticDiffusion. It significantly improves fidelity and composition in lower and higher resolution images compared to personalized models, without style domain transfer. ResAdapter is compatible with other modules like ControlNet, IP-Adapter, and LCM-LoRA, and can optimize the generation efficiency of multi-resolution models like ElasticDiffusion.	Failure cases are prominent with generic prompts on personalized models, potentially needing prompt correction using a large language model. Future work could explore integrating super-resolution models for faster high-resolution image generation.	diffusion models, resolution extrapolation, resolution interpolation, style domain consistency, text-to-image generation
2403.01852 Report	PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis	Zhengyao Lv, Yuxiang Wei, Wangmeng Zuo, Kwan-Yee K. Wong	Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless, synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically, we first employ the layout control map to faithfully represent layouts in the feature space. Subsequently, we combine the layout and semantic features in a timestep-adaptive manner to synthesize images with realistic details. During fine-tuning, we propose the Semantic Alignment (SA) loss to further enhance layout alignment. Additionally, we introduce the Layout-Free Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the priors of pre-trained models, thereby improving the visual quality and semantic consistency of synthesized images. Extensive experiments demonstrate that our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment. The source code and model are available at https://github.com/cszy98/PLACE/tree/main.	This paper proposes PLACE, an adaptive layout-semantic fusion module, to enhance the quality and layout consistency of images synthesized from semantic maps using pre-trained text-to-image diffusion models.	Synthesizing high-quality images with consistent semantics and layout from semantic maps remains challenging for existing text-to-image synthesis models.	PLACE leverages a layout control map for accurate layout representation and employs an adaptive fusion module to integrate layout and semantic features during image synthesis. It also introduces a semantic alignment loss and a layout-free prior preservation loss during fine-tuning.	PLACE achieves state-of-the-art visual quality and semantic consistency scores on ADE20K and COCO-Stuff datasets. It demonstrates superior performance in synthesizing out-of-distribution images with new objects, styles, and attributes. The proposed layout control map, adaptive fusion module, and loss functions are shown to contribute to the performance improvements through ablation studies.	The inference speed of PLACE is still slower than GAN-based methods, limited by the diffusion process. Synthesizing images from long or uncommon prompts might result in inconsistency due to limitations of the pre-trained Stable Diffusion model.	semantic image synthesis, layout control, text-to-image synthesis, diffusion models, adaptive fusion
2403.01807 Report	ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models	Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner	3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).	The paper proposes a method to generate high-quality, multi-view consistent images of 3D objects in authentic surroundings using pretrained text-to-image diffusion models fine-tuned on real-world multi-view data.	This approach bridges the gap between the diversity of text-to-3D methods and the photorealism of diffusion models trained on smaller 3D datasets, enabling the generation of realistic and diverse 3D assets.	The method augments the U-Net architecture of pretrained text-to-image models with cross-frame-attention layers and projection layers to encode 3D knowledge and ensure consistency. It employs an autoregressive generation scheme to render images from any viewpoint, enabling novel view synthesis.	The method significantly improves FID and KID scores compared to existing multi-view diffusion models, demonstrating higher visual quality and similarity to real images. The generated images are 3D-consistent, allowing for smooth novel view synthesis and enabling the optimization of NeRF or NeuS representations. The method retains the diversity of pretrained text-to-image models, allowing for controllable generation based on text descriptions and combining attributes in novel ways.	Slight inconsistencies, like view-dependent lighting and sharpness variations, may occur due to the nature of the real-world training data. The current work focuses on object-level generation; extending it to scene-scale generation is a potential future direction.	text-to-3d, diffusion models, multi-view consistency, novel view synthesis, 3d asset generation
2403.01800 Report	AtomoVideo: High Fidelity Image-to-Video Generation	Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, Bo Zheng	Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.	Presents AtomoVideo, a high-fidelity image-to-video generation framework that prioritizes fidelity to the input image and generates videos with greater motion intensity while maintaining temporal consistency.	Addresses the limitations of existing image-to-video generation methods that struggle to balance fidelity with the input image and generating coherent motion in the video.	Leverages a pre-trained text-to-image model, injecting image information at multiple levels: low-level details are concatenated with input noise, while high-level semantics are introduced through cross-attention. Employs zero terminal SNR and v-prediction during training to enhance stability.	Achieves state-of-the-art performance on several image-to-video generation benchmarks, demonstrating high fidelity to the input image and superior motion intensity. Demonstrates the flexibility to be combined with personalized text-to-image models, enabling diverse video styles. Extends to long video generation through iterative frame prediction.	Slight underperformance in image consistency and video quality compared to commercial methods, potentially due to the use of a fixed base model and resolution limitations. Limited exploration of stylistic variations, focusing primarily on realistic videos.	image-to-video generation, diffusion models, video synthesis, high-fidelity generation, temporal consistency
2403.01779 Report	OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on	Yuhao Xu, Tao Gu, Weifeng Chen, Chengcai Chen	We present OOTDiffusion, a novel network architecture for realistic and controllable image-based virtual try-on (VTON). We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the garment detail features. Without a redundant warping process, the garment features are precisely aligned with the target human body via the proposed outfitting fusion in the self-attention layers of the denoising UNet. In order to further enhance the controllability, we introduce outfitting dropout to the training process, which enables us to adjust the strength of the garment features through classifier-free guidance. Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating an impressive breakthrough in virtual try-on. Our source code is available at https://github.com/levihsu/OOTDiffusion.	Proposed OOTDiffusion, an LDM-based network architecture with a novel outfitting UNet for realistic and controllable virtual try-on.	Image-based virtual try-on (VTON) is vital for e-commerce, but existing methods struggle to balance realism with preserving garment details.	Leverages pretrained LDMs for realism, employs an outfitting UNet to learn garment features, uses outfitting fusion to align features, and introduces outfitting dropout for controllable generation.	Achieves state-of-the-art performance on VITON-HD and Dress Code datasets, surpassing GAN-based and other LDM-based methods in realism and detail preservation. Demonstrates superior generalization ability in cross-dataset evaluations. Outfitting dropout with classifier-free guidance effectively controls garment feature strength.	May not perform well for cross-category virtual try-on due to training on paired data. Minor details in the original human image might be altered after the try-on process.	virtual try-on, latent diffusion models, outfitting fusion, classifier-free guidance, image generation
2403.01693 Report	HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances	Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, Minh Hoai	Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.	This paper introduces a novel text-to-image generation model that produces images with realistic hand appearances by incorporating SMPL-H parameters.	Existing text-to-image generation models often struggle to depict hands accurately. This method aims to address this limitation and enhance the realism of generated images.	The model utilizes a two-component system: (1) a diffusion model generating SMPL-H parameters from text prompts, and (2) a text-to-image generation model conditioned on both the text and generated SMPL-H parameters.	The model demonstrates superior performance in generating realistic hand appearances compared to baseline models. User studies confirm the plausibility and relevance of the generated hand poses. The method allows for creative control over the generated image by modifying the SMPL-H parameters.	The model may face challenges generating complex hand-object interactions due to the lack of object information in the first component. Further investigation is needed to quantitatively evaluate the diversity of generated images.	text-to-image generation, hand pose estimation, smpl-h, diffusion models, generative models
2403.01643 Report	You Need to Pay Better Attention	Mehran Hosseini, Peyman Hosseini	We introduce three new attention mechanisms that outperform standard multi-head attention in terms of efficiency and learning capabilities, thereby improving the performance and broader deployability of Transformer models. Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications. In addition to providing rigorous mathematical comparisons, we evaluate the presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.	The paper introduces three novel attention mechanisms: Optimised Attention, Efficient Attention, and Super Attention, designed to improve the efficiency and performance of Transformer models.	Large language models, while powerful, pose challenges in terms of computational cost, memory footprint, and deployability on resource-constrained devices. This paper addresses these limitations by optimizing the core attention mechanism.	The authors mathematically analyze the standard attention mechanism, identifying redundancies and proposing optimizations based on three key principles: combining consecutive linear transformations, leveraging single-head attention, and introducing kernels between inputs. The proposed mechanisms are evaluated on image classification (MNIST, CIFAR100) and text sentiment analysis (IMDB, Amazon Reviews) tasks.	Optimised Attention reduces the attention layer size by 25% and computational cost, performing similarly to standard attention. Efficient Attention, with half the parameters of standard attention, achieves comparable performance while being up to twice as fast. Super Attention outperforms standard attention in both vision and language tasks, while being 25% smaller and up to 45% faster for specific context sizes.	The paper primarily focuses on classification tasks due to computational constraints. Future work could explore the application and optimization of these attention mechanisms in more complex tasks beyond classification, such as language generation or object detection.	attention mechanism, transformer, efficiency, deep learning, natural language processing
2403.01560 Report	Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition	Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng	Contrastive Language-Image Pretraining (CLIP) has shown remarkable open-vocabulary abilities across various image understanding tasks. Building upon this impressive success, recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by the fact that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. Our evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. To address this task, our work focuses on a critical challenge, namely scene bias, and we accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experimental results demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.	This work introduces XOV-Action, a benchmark for cross-domain open-vocabulary action recognition, and proposes SATA, a novel Scene-Aware video-Text Alignment method to improve performance on this task.	Generalizing to unseen video domains is crucial for real-world action recognition applications, but existing CLIP-based video learners struggle with domain shifts.	XOV-Action benchmark comprises two source datasets for training and four target datasets with various domain gaps for testing. SATA mitigates scene bias by distinguishing video representations from scene-encoded text representations during training, encouraging the model to focus on action information rather than scene details.	Existing CLIP-based video learners show limited performance in cross-domain open-vocabulary action recognition. SATA outperforms state-of-the-art methods on XOV-Action by mitigating scene bias effectively. Analysis of SATA components demonstrates the importance of scene-aware losses and text-adaptive aggregation.	The current SATA method primarily addresses scene bias, but other factors contributing to domain gaps remain unexplored. Future work will focus on tackling cross-category generalization in cross-domain settings.	action recognition, open vocabulary, domain generalization, clip, benchmark
2403.01444 Report	3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos	Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, Wei Xing	Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes from multi-view videos remains a challenging endeavor. Despite the remarkable advancements achieved by current neural rendering techniques, these methods generally require complete video sequences for offline training and are not capable of real-time rendering. To address these constraints, we introduce 3DGStream, a method designed for efficient FVV streaming of real-world dynamic scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12 seconds and real-time rendering at 200 FPS. Specifically, we utilize 3D Gaussians (3DGs) to represent the scene. Instead of the na\"ive approach of directly optimizing 3DGs per-frame, we employ a compact Neural Transformation Cache (NTC) to model the translations and rotations of 3DGs, markedly reducing the training time and storage required for each FVV frame. Furthermore, we propose an adaptive 3DG addition strategy to handle emerging objects in dynamic scenes. Experiments demonstrate that 3DGStream achieves competitive performance in terms of rendering speed, image quality, training time, and model storage when compared with state-of-the-art methods.	This paper proposes 3DGStream, a novel method for efficient free-viewpoint video streaming of dynamic scenes.	Constructing photo-realistic free-viewpoint videos of dynamic scenes is crucial for VR/AR/XR applications but remains challenging due to limitations in existing methods that require complete video sequences for training and lack real-time rendering capabilities.	The method leverages 3D Gaussians (3DG) for scene representation and employs a two-stage per-frame training pipeline. Stage 1 uses a Neural Transformation Cache (NTC) to efficiently model 3DG transformations, while Stage 2 introduces an adaptive 3DG addition strategy to handle emerging objects.	3DGStream achieves competitive performance in terms of image quality and model storage compared to state-of-the-art methods. The method achieves fast on-the-fly per-frame reconstruction within 12 seconds. 3DGStream enables real-time rendering of free-viewpoint videos at 200 FPS.	The quality of the initial frame reconstruction using 3DGs heavily influences the overall performance. The limited number of training iterations for efficiency may restrict the modeling of drastic motions or complex emerging objects.	free-viewpoint video, neural rendering, 3d gaussian splatting, dynamic scene reconstruction, real-time rendering
2403.01427 Report	Logit Standardization in Knowledge Distillation	Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, Xiaochun Cao	Knowledge distillation involves transferring soft labels from a teacher to a student using a shared temperature-based softmax function. However, the assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. This side-effect limits the performance of student, considering the capacity discrepancy between them and the finding that the innate logit relations of teacher are sufficient for student to learn. To address this issue, we propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization before applying softmax and Kullback-Leibler divergence. Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods. We also show a typical case where the conventional setting of sharing temperature between teacher and student cannot reliably yield the authentic distillation evaluation; nonetheless, this challenge is successfully alleviated by our Z-score. We extensively evaluate our method for various student and teacher models on CIFAR-100 and ImageNet, showing its significant superiority. The vanilla knowledge distillation powered by our pre-process can achieve favorable performance against state-of-the-art methods, and other distillation variants can obtain considerable gain with the assistance of our pre-process.	This paper proposes a novel knowledge distillation (KD) method that employs a logit z-score standardization process as a pre-processing step before applying softmax and calculating the KL divergence loss. This approach addresses the limitations of conventional KD methods that enforce a strict match in logit magnitude between the teacher and student models.	Conventional KD methods, relying on a shared temperature for teacher and student softmax functions, implicitly force an exact match between their logits, neglecting the capacity gap between them and the finding that preserving inter-class relations in logits suffices for effective knowledge transfer. This can hinder the student's performance.	The authors first derive the softmax function in KD from the principle of entropy maximization, demonstrating that temperature values can be different for the teacher and student. They then propose a z-score standardization process on the logits before applying softmax, using weighted logit standard deviation as an adaptive temperature. This allows the student to learn the essential logit relations from the teacher without being constrained by magnitude matching.	The proposed logit standardization method consistently improves the performance of various existing logit-based KD approaches on CIFAR-100 and ImageNet datasets. Vanilla KD equipped with the proposed pre-processing achieves comparable results to state-of-the-art feature-based KD methods. The method effectively addresses the issue of inauthentic evaluation of student performance caused by shared temperatures in conventional KD pipelines.	The pre-processing necessitates a larger weight for the KD loss compared to the cross-entropy loss, potentially requiring further investigation and optimization. Future work includes exploring the application of the proposed logit standardization pre-process in other areas like confidence calibration and uncertainty estimation.	knowledge distillation, logit standardization, z-score, softmax temperature, deep neural networks
2403.01422 Report	MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies	Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, Tao Chen	The development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In the face of these challenges, we propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals. Our approach stands out for its flexibility and scalability, making it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.	This paper presents MovieLLM, a novel framework for generating synthetic data to improve the understanding of long videos in multimodal models.	Current multimodal models struggle with long videos due to the lack of high-quality, diverse, and extensive video data, which is difficult and expensive to collect and annotate.	The framework uses GPT-4 to generate detailed movie plots and text-to-image models (guided by textual inversion for style consistency) to produce corresponding keyframes. This data is then used to fine-tune existing long-form video understanding models like LLaMA-VID.	MovieLLM generates consistent and high-quality keyframes, outperforming existing multi-concept customization methods in terms of frame consistency, text-image alignment, and image quality. Models trained on MovieLLM's synthetic data show significant performance improvements in both short and long video understanding tasks, including zero-shot question answering and comprehension of video overview, plot, and temporal aspects. A new benchmark for long-form video understanding is proposed based on the MovieNet database and human-generated question-answer pairs.	The forgetting issue inherent in LLMs might lead to inconsistencies in the generated frame descriptions and discontinuities in video scenes. Future work will focus on refining the text generation component to address this limitation.	multimodal learning, video understanding, synthetic data generation, large language models, text-to-image synthesis
2403.01306 Report	ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation	Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes	Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.	This paper introduces Image Caption Concreteness (ICC), a novel metric designed to assess the visual concreteness of image captions without relying on image references.	Web-scale datasets used for training multimodal models often contain noisy and abstract captions that hinder effective learning. Existing filtering methods struggle to identify these problematic captions while maintaining semantically relevant ones. ICC addresses this challenge by focusing on caption concreteness, a crucial aspect for effective multimodal learning.	ICC leverages the capabilities of large foundation models through two autoencoding pipelines: a Visual-Bottleneck Autoencoder (VBA) utilizing a text-to-image model and a captioning model, and a Semantic-Bottleneck Autoencoder (SBA) employing CLIP text embeddings and a large language model. These pipelines' reconstruction scores are then distilled into a smaller language model, enabling efficient ICC score generation for new text.	ICC demonstrates superior performance in curating high-quality image-caption pairs from large datasets compared to existing filtering methods, leading to improved performance in downstream tasks like image captioning and representation learning. There is a strong correlation observed between ICC scores and human judgments of concreteness for both single-word and sentence-level texts, highlighting its effectiveness in capturing human intuition about visual concreteness. Combining both VBA and SBA pipelines proves crucial for ICC's effectiveness, as each approach compensates for the other's weaknesses in accurately identifying abstract or concrete captions.	ICC may not be sensitive enough to grammatical inconsistencies in captions, potentially assigning high scores to poorly structured but semantically concrete sentences. This limitation could be addressed by training the distillation model on a more diverse range of caption styles. The experiments were conducted on a relatively small dataset due to computational limitations. Future work could explore the impact of scaling up the dataset size and evaluating ICC's efficacy on a wider array of downstream tasks like VQA and caption ranking.	multimodal learning, dataset curation, text concreteness, image captioning, representation learning
2403.01212 Report	TCIG: Two-Stage Controlled Image Generation with Quality Enhancement through Diffusion	Salaheldin Mohamed	In recent years, significant progress has been made in the development of text-to-image generation models. However, these models still face limitations when it comes to achieving full controllability during the generation process. Often, specific training or the use of limited models is required, and even then, they have certain restrictions. To address these challenges, A two-stage method that effectively combines controllability and high quality in the generation of images is proposed. This approach leverages the expertise of pre-trained models to achieve precise control over the generated images, while also harnessing the power of diffusion models to achieve state-of-the-art quality. By separating controllability from high quality, This method achieves outstanding results. It is compatible with both latent and image space diffusion models, ensuring versatility and flexibility. Moreover, This approach consistently produces comparable outcomes to the current state-of-the-art methods in the field. Overall, This proposed method represents a significant advancement in text-to-image generation, enabling improved controllability without compromising on the quality of the generated images.	This paper presents TCIG, a novel two-stage method for controllable text-to-image generation that leverages pre-trained models (segmentation and diffusion) without requiring training or fine-tuning.	Existing text-to-image generation models often lack full controllability, struggling to incorporate user preferences beyond textual prompts. Existing solutions often involve costly training, fine-tuning, or are limited by model architectures.	TCIG first generates a controlled image based on input segmentation masks and text using a pre-trained VQGAN guided by a CLIP network and segmentation models. The second stage refines the image for quality and detail using a pre-trained diffusion model (Img-to-Img).	TCIG allows flexible and controllable image generation with diverse outputs. Quantitative evaluation on the COCO dataset shows TCIG outperforms existing methods in terms of IoU. Qualitative comparison highlights TCIG's superior adherence to input masks compared to other models.	The development of this method faced limitations due to the computational power of GPUs. Future work can explore separating control from high-quality image generation further.	image generation, controllable generation, text-to-image, diffusion models, segmentation
2403.01124 Report	Text-guided Explorable Image Super-resolution	Kanchana Vaishnavi Gandikota, Paramanand Chandramouli	In this paper, we introduce the problem of zero-shot text-guided exploration of the solutions to open-domain image super-resolution. Our goal is to allow users to explore diverse, semantically accurate reconstructions that preserve data consistency with the low-resolution inputs for different large downsampling factors without explicitly training for these specific degradations. We propose two approaches for zero-shot text-guided super-resolution - i) modifying the generative process of text-to-image \textit{T2I} diffusion models to promote consistency with low-resolution inputs, and ii) incorporating language guidance into zero-shot diffusion-based restoration methods. We show that the proposed approaches result in diverse solutions that match the semantic meaning provided by the text prompt while preserving data consistency with the degraded inputs. We evaluate the proposed baselines for the task of extreme super-resolution and demonstrate advantages in terms of restoration quality, diversity, and explorability of solutions.	This paper introduces zero-shot text-guided exploration of solutions for open-domain image super-resolution, enabling users to explore diverse, semantically accurate reconstructions that are consistent with low-resolution inputs using text prompts.	This is important because it allows for more intuitive and flexible control over the super-resolution process, especially for high upscaling factors where the problem is ill-posed and has many possible solutions.	The authors propose two approaches: 1) modifying the generative process of text-to-image (T2I) diffusion models to promote consistency with low-resolution inputs, and 2) incorporating language guidance into zero-shot diffusion-based restoration methods using CLIP.	Text-guided super-resolution methods achieve comparable performance to specialized models trained on faces for neutral prompts. Text-guided methods significantly improve image quality and semantic matching on open-domain images compared to unconditional diffusion models. User studies show that Imagen-DDNM and unCLIP-DDNM produce more realistic and semantically consistent results compared to CLIP-guided DDNM.	Generating realistic images consistently can be challenging, requiring multiple attempts to achieve the desired output. The performance of all methods depends on the generative capabilities of the pre-trained generative model and inherits its biases.	image super-resolution, text-guided image generation, diffusion models, zero-shot learning, clip
2403.00939 Report	G3DR: Generative 3D Reconstruction in ImageNet	Pradyumna Reddy, Ismail Elezi, Jiankang Deng	We introduce a novel 3D generative method, Generative 3D Reconstruction (G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects from single images, addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model, such as CLIP, to enable reconstruction in novel views and improve the visual realism of generations. Additionally, G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity, G3DR is able to beat state-of-theart methods, improving over them by up to 22% in perceptual metrics and 90% in geometry scores, while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR	Introduces G3DR, a novel 3D generative method that generates diverse and high-quality 3D objects from single images in ImageNet, addressing limitations of existing methods.	3D asset generation is crucial for various applications, and G3DR enables this from diverse, unaligned datasets like ImageNet.	Combines a latent diffusion model with a conditional triplane generator and a novel depth regularization technique to ensure geometric fidelity and improve visual realism.	Achieves state-of-the-art results on ImageNet, improving FID score by 22% and Inception Score by 21.5% over previous methods. Significantly outperforms competing methods in geometry evaluation, almost doubling the Non-Flatness Score and achieving better depth accuracy. Demonstrates strong performance in fine-grained datasets and generalizes well to text-conditioned generation and out-of-domain examples.	Reliance on pseudo-ground truth depth maps from an off-the-shelf estimator may limit geometry accuracy. Exploring alternative novel view supervision methods beyond CLIP could further enhance generation quality.	3d generation, imagenet, depth regularization, single-view reconstruction, generative models
2403.00835 Report	CLLMs: Consistency Large Language Models	Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang	Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.	This paper proposes Consistency Large Language Models (CLLMs), a new method to adapt large language models (LLMs) for fast parallel decoding with Jacobi iteration.	Traditional autoregressive decoding in LLMs leads to high latency, especially for lengthy responses. Existing solutions often require additional models or architectural changes with significant overhead. CLLMs addresses these limitations to achieve significant speedup with minimal performance degradation and without extra models or architectural modifications.	CLLMs are trained on Jacobi trajectory datasets collected from a target LLM, employing a consistency loss to encourage the prediction of multiple correct tokens in each Jacobi iteration. This approach is inspired by consistency models used for accelerating diffusion models.	CLLMs achieve 2.4x to 3.4x speedup with Jacobi decoding on various tasks, including GSM8K, CodeSearchNet Python, Spider, and MT-bench. The acceleration is attributed to the fast-forwarding phenomenon, where CLLMs can predict multiple consecutive tokens correctly in a single iteration, and the emergence of stationary tokens, which are predicted correctly early on and remain unchanged despite preceding incorrect tokens. CLLMs show advantages over existing methods like speculative decoding and Medusa with higher adaptability to existing LLMs and lower memory consumption.	The efficiency of CLLMs heavily relies on the quality and size of the Jacobi trajectory dataset. Current training procedure of CLLMs introduces extra overhead for collecting Jacobi trajectory dataset.	large language models, efficient inference, parallel decoding, jacobi iteration, consistency models
2403.00818 Report	DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models	Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, Yunhe Wang	Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks. code is avalaible at https://github.com/WailordHe/DenseSSM	This paper introduces DenseSSM, a novel approach for enhancing state space models (SSMs) by selectively integrating hidden states from shallow layers into deeper layers to improve information flow and retain fine-grained information.	Large language models (LLMs) based on Transformers face challenges with computational and memory demands. While SSMs offer lower complexity, their performance hasn't matched Transformers. DenseSSM aims to bridge this performance gap by enhancing information flow in SSMs.	DenseSSM addresses hidden state degradation in deeper SSM layers by: 1) Collecting and projecting shallow layer hidden states to the target layer's subspace using a selective transition module. 2) Fusing these projected hidden states with the target layer's hidden state using a hidden fusion module.	DenseSSM significantly improves the performance of various SSM architectures like RetNet and Mamba. DenseRetNet, based on DenseSSM, outperforms the original RetNet by up to 5% accuracy on public benchmarks. DenseSSM maintains the training parallelizability and inference efficiency of SSMs while achieving these improvements.	The paper primarily focuses on evaluating DenseSSM on language modeling tasks, leaving exploration in other domains for future work. Further investigation into different implementations of the selective transition and hidden fusion modules could yield additional performance gains.	state space models, large language models, deep learning, natural language processing, dense connections
2403.00762 Report	Point Cloud Mamba: Point Cloud Learning via State Space Model	Tao Zhang, Xiangtai Li, Haobo Yuan, Shunping Ji, Shuicheng Yan	In this work, for the first time, we demonstrate that Mamba-based point cloud methods can outperform point-based methods. Mamba exhibits strong global modeling capabilities and linear computational complexity, making it highly attractive for point cloud analysis. To enable more effective processing of 3-D point cloud data by Mamba, we propose a novel Consistent Traverse Serialization to convert point clouds into 1-D point sequences while ensuring that neighboring points in the sequence are also spatially adjacent. Consistent Traverse Serialization yields six variants by permuting the order of x, y, and z coordinates, and the synergistic use of these variants aids Mamba in comprehensively observing point cloud data. Furthermore, to assist Mamba in handling point sequences with different orders more effectively, we introduce point prompts to inform Mamba of the sequence's arrangement rules. Finally, we propose positional encoding based on spatial coordinate mapping to inject positional information into point cloud sequences better. Based on these improvements, we construct a point cloud network named Point Cloud Mamba, which combines local and global modeling. Point Cloud Mamba surpasses the SOTA point-based method PointNeXt and achieves new SOTA performance on the ScanObjectNN, ModelNet40, and ShapeNetPart datasets.	This paper introduces Point Cloud Mamba (PCM), a novel framework for point cloud learning that leverages the strengths of state space models, specifically Mamba, to achieve global feature modeling with linear computational complexity.	While state space models like Mamba have shown promise in sequence modeling, their application to 3D point cloud analysis remained unexplored. This paper bridges that gap by demonstrating the effectiveness of Mamba for point cloud tasks, offering a compelling alternative to computationally expensive Transformer-based methods.	PCM employs a novel Consistent Traverse Serialization (CTS) method to convert 3D point clouds into 1D sequences suitable for Mamba. It introduces 'order prompts' to help Mamba discern the arrangement of points in sequences generated by different CTS variants and utilizes a spatial coordinate mapping-based positional encoding scheme. The overall architecture combines local geometric feature extraction with global modeling using Mamba layers.	PCM surpasses the state-of-the-art point-based method PointNeXt on ScanObjectNN, ModelNet40, and ShapeNetPart datasets. The Consistent Traverse Serialization strategy, combined with multiple serialization orders, is shown to be crucial for capturing spatial relationships within point clouds. Order prompts and spatial coordinate mapping-based positional encoding significantly contribute to PCM's performance.	For large-scale point clouds, the scan-based training of Mamba limits its applicability, necessitating point cloud cropping and creating a discrepancy between training and inference. The throughput of PCM, while showing promise, is currently lower than PointMLP due to the computational overhead of multiple reorderings.	3d point cloud, state space models, mamba, point cloud classification, point cloud segmentation
2403.00729 Report	Can Transformers Capture Spatial Relations between Objects?	Chuan Wen, Dinesh Jayaraman, Yang Gao	Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple "RelatiViT" architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}.	This paper introduces RelatiViT, a novel transformer-based architecture designed for precise and physically grounded spatial relation prediction in computer vision.	Recognizing spatial relations between objects is crucial for scene understanding and robot manipulation, but existing methods struggle to surpass naive bounding-box-based baselines.	The authors systematically explore different transformer designs, focusing on feature extraction, query localization, context aggregation, and pair interaction. They benchmark these designs on Rel3D and a refined version of SpatialSense, called SpatialSense+, featuring precise relation definitions and annotations.	RelatiViT significantly outperforms all existing methods, including naive baselines and adapted visual relation detection models. RelatiViT effectively leverages visual information, outperforming baselines on relations requiring depth, pose, and shape understanding. Ablation studies confirm the importance of feature extraction, context aggregation, and pair interaction in RelatiViT's performance.	The current study primarily focuses on pairwise relations and doesn't explicitly address higher-order relationships. Future work could explore incorporating depth information or 3D object representations to further improve performance.	spatial relation prediction, vision transformer, computer vision, scene understanding, benchmarking
2403.00712 Report	Rethinking Inductive Biases for Surface Normal Estimation	Gwangbin Bae, Andrew J. Davison	Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.	This paper introduces a new method for single-image surface normal estimation that encodes per-pixel ray direction and models the pairwise relative rotation between nearby pixels.	Existing surface normal estimation models rely on general-purpose dense prediction models, neglecting task-specific inductive biases. This limits accuracy, especially for images with out-of-distribution camera intrinsics.	The proposed method encodes camera intrinsics via ray direction and utilizes a ray direction-based activation function for visibility. It recasts normal estimation as rotation estimation, learning relative rotations between neighboring pixels for piecewise smooth predictions.	The method demonstrates strong generalization ability, outperforming state-of-the-art methods on challenging datasets with diverse scenes and camera intrinsics. It achieves high sample efficiency, requiring an order of magnitude smaller dataset than competing methods. Qualitative results showcase superior detail and sharpness, particularly near object boundaries.	The method assumes prior knowledge of camera intrinsics, limiting its applicability to images without such information. Future work explores joint estimation of camera intrinsics and surface normals for improved generalization to in-the-wild images.	surface normal estimation, inductive bias, ray direction, rotation estimation, piecewise smooth
2403.00644 Report	Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks	Yuhao Liu, Zhanghan Ke, Fang Liu, Nanxuan Zhao, Rynson W. H. Lau	Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes.	This paper proposes Diff-Plugin, a novel framework that enhances pre-trained diffusion models for handling various low-level vision tasks requiring stringent detail preservation.	Existing diffusion models struggle with detail preservation in low-level vision tasks due to the randomness in the diffusion process. Diff-Plugin addresses this by incorporating task-specific priors without retraining the entire model.	Diff-Plugin consists of a lightweight, dual-branch Task-Plugin module to inject task-specific priors into the diffusion process and a Plugin-Selector that allows users to choose the desired Task-Plugin via text input.	Diff-Plugin demonstrates superior performance over existing diffusion and regression-based methods, particularly in real-world scenarios. The framework is flexible and scalable, adapting to new tasks and datasets without affecting existing trained plugins. User studies confirm a preference for Diff-Plugin's output quality and content consistency.	A current limitation is the inability to perform local editing. Future work will explore integrating LLMs for region-specific task application.	diffusion models, low-level vision, image editing, task-specific priors, text-driven editing
2403.00587 Report	Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset	Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre, Frank Keller	Existing work has observed that current text-to-image systems do not accurately reflect explicit spatial relations between objects such as 'left of' or 'below'. We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models. We propose an automatic method that, given existing images, generates synthetic captions that contain 14 explicit spatial relations. We introduce the Spatial Relation for Generation (SR4G) dataset, which contains 9.9 millions image-caption pairs for training, and more than 60 thousand captions for evaluation. In order to test generalization we also provide an 'unseen' split, where the set of objects in the train and test captions are disjoint. SR4G is the first dataset that can be used to spatially fine-tune text-to-image systems. We show that fine-tuning two different Stable Diffusion models (denoted as SD$_{SR4G}$) yields up to 9 points improvements in the VISOR metric. The improvement holds in the 'unseen' split, showing that SD$_{SR4G}$ is able to generalize to unseen objects. SD$_{SR4G}$ improves the state-of-the-art with fewer parameters, and avoids complex architectures. Our analysis shows that improvement is consistent for all relations. The dataset and the code will be publicly available.	This paper introduces SR4G, a new synthetic dataset for training and evaluating the ability of text-to-image models to understand and generate images from textual descriptions containing explicit spatial relations.	Current text-to-image systems struggle to accurately represent explicit spatial relations, limiting their use in applications like text-based image editing. This is mainly because training datasets lack captions with explicit spatial relations.	SR4G leverages object annotations from the COCO dataset and heuristic rules to automatically generate synthetic captions containing 14 explicit spatial relations, paired with real images.	Fine-tuning Stable Diffusion models on SR4G leads to significant improvements in spatial relation understanding, as measured by the VISOR metric. The fine-tuned models generalize to unseen objects, indicating a deeper understanding of spatial relations beyond object-specific correlations. SR4G enables fine-tuned Stable Diffusion models to outperform larger and more complex state-of-the-art pipeline models in spatial relation generation.	SR4G currently only supports English captions, limiting its applicability to other languages. The dataset focuses on unambiguous spatial relations defined over bounding box information, excluding orientation and 3D relations.	text-to-image generation, spatial relations, synthetic datasets, stable diffusion, computer vision
2403.00522 Report	VisionLLaMA: A Unified LLaMA Interface for Vision Tasks	Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen	Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code will be released at https://github.com/Meituan-AutoML/VisionLLaMA.	The paper proposes VisionLLaMA, a vision transformer architecture inspired by the LLaMA architecture for large language models, aiming to bridge the architectural gap between vision and language modalities.	The success of LLaMA in NLP motivates the exploration of a similar architecture for vision, potentially enabling unified architectures and shared deployment techniques for both modalities.	The paper adapts the LLaMA architecture to process 2D images, investigates plain and pyramid transformer variants, and introduces AS2DRoPE (auto-scaled 2D RoPE) to handle variable input resolutions.	VisionLLaMA significantly outperforms DiT and SiT, state-of-the-art vision transformers for image generation, across various model sizes and evaluation metrics. In image classification, VisionLLaMA achieves competitive performance compared to DeiT3 and Twins under both supervised and self-supervised training settings. VisionLLaMA demonstrates superiority in downstream tasks like semantic segmentation (ADE20K) and object detection (COCO), outperforming Swin and Twins in terms of mIoU and mAP.	The paper primarily focuses on square image inputs, leaving the exploration for arbitrary aspect ratios as future work. Further investigation into combining VisionLLaMA with modality-specific components like PEG is needed to maximize its potential.	vision transformer, llama, image generation, image classification, positional encoding
2403.00483 Report	RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization	Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, Yongdong Zhang	Text-to-image customization, which aims to synthesize text-driven images for the given subjects, has recently revolutionized content creation. Existing works follow the pseudo-word paradigm, i.e., represent the given subjects as pseudo-words and then compose them with the given text. However, the inherent entangled influence scope of pseudo-words with the given text results in a dual-optimum paradox, i.e., the similarity of the given subjects and the controllability of the given text could not be optimal simultaneously. We present RealCustom that, for the first time, disentangles similarity from controllability by precisely limiting subject influence to relevant parts only, achieved by gradually narrowing real text word from its general connotation to the specific subject and using its cross-attention to distinguish relevance. Specifically, RealCustom introduces a novel "train-inference" decoupled framework: (1) during training, RealCustom learns general alignment between visual conditions to original textual conditions by a novel adaptive scoring module to adaptively modulate influence quantity; (2) during inference, a novel adaptive mask guidance strategy is proposed to iteratively update the influence scope and influence quantity of the given subjects to gradually narrow the generation of the real text word. Comprehensive experiments demonstrate the superior real-time customization ability of RealCustom in the open domain, achieving both unprecedented similarity of the given subjects and controllability of the given text for the first time. The project page is https://corleone-huang.github.io/realcustom/.	This paper presents RealCustom, a novel text-to-image customization paradigm that disentangles subject similarity from text controllability by limiting subject influence to relevant image regions.	Existing pseudo-word based customization methods suffer from a dual-optimum paradox where optimizing for subject similarity often degrades controllability of the text prompt, and vice versa. This limits their ability to achieve high-quality customization in real-time and open-domain scenarios.	RealCustom introduces a train-inference decoupled framework. During training, an adaptive scoring module learns general alignment between visual and textual conditions. During inference, an adaptive mask guidance strategy progressively narrows down the generation of a real text word (e.g., "toy") to the specific given subject by iteratively updating its influence scope and quantity based on cross-attention.	RealCustom achieves superior simultaneous similarity and controllability compared to state-of-the-art methods. The method enables real-time open-domain customization, generalizing to any given subject without requiring training on specific object datasets. RealCustom exhibits high-quality generation with better aesthetics compared to existing methods.	The influence scope of the given subject is limited to the top-k attention region of a single real word, which could be further improved. RealCustom focuses on the single subject customization. Extending it to multiple subjects is an interesting future direction.	text-to-image customization, generative models, diffusion models, cross-attention, open-domain customization
2403.00459 Report	Deformable One-shot Face Stylization via DINO Semantic Guidance	Yang Zhou, Zichong Chen, Hui Huang	This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS	This paper introduces a novel deformable one-shot face stylization framework that leverages DINO semantic guidance to achieve both appearance and structure stylization using a single real-style image pair.	Existing one-shot face stylization methods primarily focus on appearance transfer and struggle to accurately capture and reproduce structural deformations present in artistic styles, especially those with exaggerated features.	The method uses a deformation-aware StyleGAN generator augmented with spatial transformers (STN). It employs DINO feature representations to guide the stylization process with two novel constraints: a directional deformation loss to regulate structural changes and a relative structural consistency constraint to preserve diversity. Style mixing is also employed for color alignment.	The method generates high-quality stylized faces with convincing structural deformations, outperforming existing one-shot methods qualitatively and quantitatively. DINO features prove effective in capturing consistent semantic representations across real and stylized face domains. The framework allows for controllable facial deformation through interpolation of the STN warping fields.	The reliance on existing generative models to produce paired training data may limit the framework's ability to learn from real-world style examples. Further exploration of DINO features could lead to improved disentanglement of appearance and structure, potentially enhancing stylization control.	face stylization, one-shot learning, deformation-aware, dino, stylegan
2403.00437 Report	LoMOE: Localized Multi-Object Editing via Multi-Diffusion	Goirik Chakrabarty, Aditya Chandrasekar, Ramya Hebbalaguppe, Prathosh AP	Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the current methods. We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing state-of-the-art methods demonstrate the improved effectiveness of our approach in terms of both image editing quality and inference speed.	Introduces LoMOE, a zero-shot framework for localized multi-object editing using diffusion models, enabling simultaneous edits to multiple objects in an image guided by masks and text prompts.	Addresses the limitations of existing text-based image editing methods that struggle with precise localized edits, particularly in complex scenes with multiple objects.	Leverages a multi-diffusion process with cross-attention matching and background preservation losses to guide the editing process within specified regions while maintaining structural consistency and background fidelity. Introduces \proposedDataset, a dataset for multi-object editing.	Achieves superior performance in neural image quality metrics compared to baseline methods, demonstrating realistic and faithful edits. Enables single-pass multi-object editing, leading to significantly faster inference times compared to iterative approaches. Demonstrates the effectiveness of cross-attention and background preservation losses in achieving a balance between realism and faithfulness in edits.	Faces challenges in realism and blending for certain edits, suggesting avenues for improving fidelity and object integration. Currently unable to handle object deletion or swapping within an image, presenting opportunities for future research.	image editing, diffusion models, multi-object editing, localized image editing, generative ai
2402.19481 Report	DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models	Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han	Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.	Introduces DistriFusion, a training-free algorithm leveraging multiple GPUs to accelerate diffusion model inference without sacrificing image quality.	Generating high-resolution images with diffusion models is computationally expensive and slow, hindering interactive applications. DistriFusion tackles this by enabling efficient parallel processing on multiple GPUs.	DistriFusion splits the input image into patches, assigns each to a GPU, and reuses activations from previous denoising steps (activation displacement) to maintain inter-patch interaction while minimizing communication overhead.	Achieves up to 6.1x speedup on eight A100 GPUs compared to a single GPU on Stable Diffusion XL. Maintains comparable image quality to the original model across various metrics (PSNR, LPIPS, FID). Effectively hides communication overhead within computation using asynchronous communication and sparse operations.	Speedups are limited for low-resolution images due to underutilized GPUs. May not be suitable for extremely-few-step sampling methods due to rapid denoising state changes.	diffusion models, parallel computing, image generation, gpu acceleration, activation displacement
2402.19479 Report	Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov	The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.	Introduces Panda-70M, a large-scale video dataset with 70 million video clips and high-quality text captions, generated using multiple cross-modality teacher models and a fine-grained retrieval model for annotation selection.	High-quality video-text data is crucial for training robust video-language models, but manual annotation is expensive and time-consuming, limiting the scale and quality of existing datasets.	1. Split 3.8M long videos into semantically coherent clips. 2. Generate multiple candidate captions per clip using eight cross-modality teacher models with different input modalities. 3. Train a fine-grained video-to-text retrieval model on a human-annotated subset to select the best caption as the final annotation.	Pretraining on Panda-70M significantly improves performance on video captioning, video and text retrieval, and text-to-video generation tasks. The proposed captioning pipeline, using multiple teacher models and fine-grained retrieval, outperforms single models and generates captions comparable to human annotations. A student captioning model trained on Panda-70M with knowledge distillation outperforms any individual teacher model and benefits from multimodal inputs.	The dataset primarily consists of vocal-intensive videos due to the source data. Focus on fine-grained, semantically consistent clips limits content diversity within a single video and average video length.	video captioning, video-text retrieval, text-to-video generation, large-scale dataset, multimodal learning
2402.19474 Report	The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai	We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them, diminishing the relation hallucination often encountered by Multi-modal Large Language Models (MLLMs). To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset ({AS-V2) which is aligned with the format of standard instruction tuning data. In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs. Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that our work can inspire more future research and contribute to the evolution towards artificial general intelligence. Our project is released at https://github.com/OpenGVLab/all-seeing.	The paper introduces the All-Seeing Project V2, a model and dataset for enhancing relation comprehension in Multi-modal Large Language Models (MLLMs).	Existing MLLMs struggle to accurately comprehend relations between objects in images, leading to hallucinations and reliance on language priors.	The authors propose a novel task called Relation Conversation (ReC) that unifies text generation, object localization, and relation comprehension. They also create a high-quality ReC dataset (AS-V2) and a benchmark for evaluating relation comprehension (CRPE).	The proposed All-Seeing Model V2 (ASMv2) achieves state-of-the-art performance on Open-ended Scene Graph Generation and various image-level and region-level vision-language tasks. ASMv2 significantly outperforms existing MLLMs on CRPE, demonstrating superior relation comprehension ability. The paper shows that training with relation conversation data significantly improves region-level visual information understanding and relation comprehension.	The paper acknowledges the need for more appropriate metrics for evaluating open-ended scene graph generation. Future work could explore more sophisticated methods for handling the imbalanced distribution of predicate labels in scene graph generation.	multimodal large language model, relation comprehension, scene graph generation, grounded language understanding, vision-language reasoning
2402.19469 Report	Humanoid Locomotion as Next Token Prediction	Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik	We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.	This paper presents a novel approach to real-world humanoid control by framing it as a next-token prediction problem, similar to language modeling.	This method leverages the power of generative modeling with transformers, successfully applied in fields like language processing, to address the challenge of real-world robot control.	A causal transformer model is trained to autoregressively predict sensorimotor trajectories, incorporating data from various sources like pre-trained policies, model-based controllers, motion capture, and even YouTube videos.	The model enables zero-shot real-world walking on diverse terrains, demonstrated through successful deployment on a Digit humanoid robot in San Francisco. The approach effectively incorporates incomplete trajectory data, such as video footage lacking action labels, leading to performance comparable to or exceeding state-of-the-art reinforcement learning methods. The model exhibits promising scaling properties, with performance improving with larger datasets, longer context lengths, and increased model size.	The reliance on simulated data for pre-training may limit the model's ability to handle certain real-world scenarios not well-represented in simulation. The current study focuses on locomotion, and extending this approach to more complex manipulation tasks presents a future challenge.	humanoid locomotion, generative modeling, transformer networks, next token prediction, real-world robotics
2402.19427 Report	Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models	Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre	Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.	This paper presents Hawk and Griffin, novel recurrent neural network models for language modeling that address the scalability limitations of traditional RNNs and offer an efficient alternative to Transformers with global attention.	Transformers, while dominant, struggle with long sequences due to the quadratic complexity of global attention. Recurrent models offer a solution by compressing sequences into a fixed-size state, but need to match Transformer performance and hardware efficiency.	The authors develop the RG-LRU, a novel gated linear recurrent layer, and integrate it into Hawk, a pure RNN model. They also introduce Griffin, a hybrid model combining RG-LRU with local attention. The models are evaluated on language modeling tasks, scaling capabilities, training and inference speed, and long context modeling.	Hawk and Griffin exhibit power-law scaling in held-out loss with increasing training FLOPs, achieving competitive performance with Transformers. Both models demonstrate superior inference throughput compared to Transformers, particularly on long sequences, due to their smaller cache size. Hawk and Griffin excel in long context modeling, extrapolating well to sequences longer than training data and efficiently learning copying and retrieval tasks.	While showing promise in copying and retrieval tasks after training, pre-trained Hawk and Griffin models lag behind Transformers in these tasks without fine-tuning. Further research is needed to improve the copying and retrieval capabilities of these models when evaluating pre-trained models.	language modeling, recurrent neural networks, transformers, long sequence modeling, inference efficiency
2402.19150 Report	Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model	Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, Renjing Xu	Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, the Typographic Attack, which disrupts vision-language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), has also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks may impact VLMs and LVLMs, leading to three highly insightful discoveries. By the examination of our discoveries and experimental validation in the Typographic Dataset, we reduce the performance degradation from $42.07\%$ to $13.90\%$ when LVLMs confront typographic attacks.	This paper investigates the vulnerability of Large Vision-Language Models (LVLMs) to typographic attacks, proposing a comprehensive Typographic Dataset (TypoD) to evaluate this weakness across diverse multi-modal tasks and typographic factors.	This research is crucial because LVLMs are increasingly used in real-world applications, and their susceptibility to typographic attacks poses significant security risks.	The authors created TypoD, containing images with strategically embedded typographic errors, to assess the performance degradation of LVLMs under different tasks. They analyzed the attention mechanisms of both vision encoders (like CLIP) and LLMs within LVLMs to understand the root cause of this vulnerability.	Typographic attacks significantly degrade the performance of LVLMs across various tasks, including object recognition, visual attribute detection, enumeration, and commonsense reasoning. The severity of typographic attacks is positively correlated with the visibility of the embedded typographic text, with larger and more opaque text causing more significant performance drops. Augmenting the prompts given to LVLMs with additional information, particularly detailed image descriptions, can mitigate the impact of typographic attacks by redirecting the model's attention away from the misleading text.	The research primarily focuses on open-source LVLMs, and further investigation is needed to assess the robustness of commercial LVLMs against typographic attacks. While the proposed prompt engineering techniques effectively mitigate the impact of typographic attacks, there might be limitations in their generalizability and effectiveness against more sophisticated attack strategies.	large vision-language models, typographic attack, multi-modal learning, robustness, vision-language tasks
2402.18956 Report	WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts	Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim	Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes, the "black box" problem still remains. Addressing this, we propose a novel framework, WWW, that offers the 'what', 'where', and 'why' of the neural network decisions in human-understandable terms. Specifically, WWW utilizes adaptive selection for concept discovery, employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why', we proposed a novel combination of neuron activation maps (NAMs) with Shapley values, generating localized concept maps and heatmaps for individual inputs. Furthermore, WWW introduces a method for predicting uncertainty, leveraging heatmap similarities to estimate 'how' reliable the prediction is. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics, outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what', 'where', and 'why', introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures.	This paper proposes WWW, a novel framework that offers interpretability for neural network decisions by explaining 'what', 'where', and 'why' in human-understandable terms.	The "black box" problem of neural networks hinders their wider adoption. WWW addresses this by providing clear explanations of decision-making processes, improving trust and reliability.	WWW utilizes adaptive cosine similarity and adaptive selection for concept discovery ('what'), combines neuron activation maps with Shapley values for localized concept maps and heatmaps ('where' and 'why'), and introduces heatmap similarities for uncertainty prediction.	WWW demonstrates superior quantitative performance in concept discovery compared to existing methods like CLIP-Dissect, MILAN, and FALCON. WWW provides qualitative explanations by identifying important neurons and concepts, showing robust interpretations across different model layers. Heatmap similarity analysis in WWW effectively predicts uncertainty, potentially enabling the identification of mispredictions.	The current implementation of WWW focuses on image classification tasks. Future work includes exploring the generalization of WWW for other data modalities.	interpretable machine learning, concept-based explanations, neuron-concept association, uncertainty prediction, explainable ai
2402.18929 Report	Navigating Beyond Dropout: An Intriguing Solution Towards Generalizable Image Super Resolution	Hongjun Wang, Jiyuan Chen, Yinqiang Zheng, Tieyong Zeng	Deep learning has led to a dramatic leap on Single Image Super-Resolution (SISR) performances in recent years. %Despite the substantial advancement% While most existing work assumes a simple and fixed degradation model (e.g., bicubic downsampling), the research of Blind SR seeks to improve model generalization ability with unknown degradation. Recently, Kong et al pioneer the investigation of a more suitable training strategy for Blind SR using Dropout. Although such method indeed brings substantial generalization improvements via mitigating overfitting, we argue that Dropout simultaneously introduces undesirable side-effect that compromises model's capacity to faithfully reconstruct fine details. We show both the theoretical and experimental analyses in our paper, and furthermore, we present another easy yet effective training strategy that enhances the generalization ability of the model by simply modulating its first and second-order features statistics. Experimental results have shown that our method could serve as a model-agnostic regularization and outperforms Dropout on seven benchmark datasets including both synthetic and real-world scenarios.	This paper proposes a simple statistical alignment method as a regularization technique for Blind Super-Resolution (SR) to enhance model generalization against unknown degradations.	Current Blind SR models, even trained with diverse degradations, tend to overfit specific degradation types, limiting their ability to generalize to unseen degradation scenarios.	The method aligns first and second order feature statistics (mean and covariance) of image pairs with identical content but different degradations. This encourages the model to learn degradation-invariant features, improving its generalization ability.	The proposed method consistently outperforms Dropout regularization on seven benchmark datasets, demonstrating its effectiveness. Significant performance improvements are observed, particularly in cases where test degradations deviate from the training distribution, indicating enhanced generalization. The method integrates seamlessly with existing data-driven Blind SR methods and can be applied to various SR models.	The choice of mean and covariance as statistical indicators, while showing empirical effectiveness, lacks a strong theoretical justification. Further investigation into the impact of different statistical alignment strategies and their theoretical underpinnings is needed.	blind super-resolution, regularization, degradation-invariant features, statistical alignment, generalization
2402.18848 Report	SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting	Hoon Kim, Minje Jang, Wonjun Yoon, Jisoo Lee, Donghyun Na, Sanghyun Woo	We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model, we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore, to overcome the limitation of scarce high-quality lightstage data, we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism.	This paper proposes SwitchLight, a novel framework for human portrait relighting that combines a physics-guided architecture based on the Cook-Torrance reflectance model with a self-supervised pre-training framework called Multi-Masked Autoencoder (MMAE).	Relighting human portraits is crucial for various applications, including VR/AR and digital content creation. Existing methods either lack realism or struggle with the limited availability of high-quality training data. This work addresses these limitations by enhancing physical accuracy and expanding training data through self-supervision.	The SwitchLight architecture comprises multiple neural networks for predicting surface normals, lighting, diffuse rendering, specular attributes, and final relighting. MMAE, inspired by Masked Autoencoder, uses dynamic masking strategies and a generative target to pre-train the model on unlabeled data, improving feature representations for relighting.	SwitchLight outperforms state-of-the-art methods in both quantitative metrics and qualitative comparisons, demonstrating enhanced realism in lighting, specular highlights, and skin tones. MMAE pre-training significantly improves performance compared to training solely on labeled data, highlighting the benefits of self-supervision for relighting. Ablation studies validate the advantages of predicting diffuse rendering over direct albedo prediction and demonstrate the effectiveness of MMAE's design choices.	The model struggles with removing strong shadows and accurately relighting reflective surfaces or face paint. Future work includes extending the framework to handle video and 3D data.	image relighting, human portrait, cook-torrance model, self-supervised learning, masked autoencoder
2402.18842 Report	ViewFusion: Towards Multi-View Consistency via Interpolated Denoising	Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel	Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet, the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this, we introduce ViewFusion, a novel, training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation, ensuring robust multi-view consistency during the novel-view generation process. Through a diffusion process that fuses known-view information via interpolated denoising, our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning. Extensive experimental results demonstrate the effectiveness of ViewFusion in generating consistent and detailed novel views.	This paper introduces Interpolated Denoising Diffusion Model (IDDM), a training-free algorithm that improves multi-view consistency in novel view synthesis using pre-trained diffusion models.	Existing diffusion-based novel view synthesis methods often produce inconsistent images across different viewpoints, hindering applications like 3D reconstruction. IDDM addresses this limitation without requiring additional training or fine-tuning.	IDDM incorporates an auto-regressive process into the diffusion process. It uses a novel Interpolated Denoising technique that leverages previously generated views as context during the generation of subsequent views, thus improving consistency.	IDDM significantly enhances multi-view consistency compared to baseline models, as evidenced by quantitative metrics (LPIPS, SIFT, CLIP) and qualitative comparisons. It enables single-view conditioned diffusion models to operate effectively in multi-view conditioned settings, leading to improved novel-view synthesis and 3D reconstruction quality. The method demonstrates strong performance on out-of-distribution datasets like ABO and GSO, showcasing its generalizability.	IDDM's sequential generation process, while improving consistency, requires additional memory and can be more time-consuming than parallel generation methods. The effectiveness of IDDM relies on the pre-trained base model (e.g., Zero-1-to-3); if the base model fails, IDDM might not fully compensate for those shortcomings.	novel view synthesis, diffusion models, multi-view consistency, 3d reconstruction, auto-regressive models
2402.18780 Report	A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D	Xiaohan Fei, Chethan Parameshwara, Jiawei Mo, Xiaolong Li, Ashwin Swaminathan, CJ Taylor, Paolo Favaro, Stefano Soatto	The development of generative models that create 3D content from a text prompt has made considerable strides thanks to the use of the score distillation sampling (SDS) method on pre-trained diffusion models for image generation. However, the SDS method is also the source of several artifacts, such as the Janus problem, the misalignment between the text prompt and the generated 3D model, and 3D model inaccuracies. While existing methods heavily rely on the qualitative assessment of these artifacts through visual inspection of a limited set of samples, in this work we propose more objective quantitative evaluation metrics, which we cross-validate via human ratings, and show analysis of the failure cases of the SDS technique. We demonstrate the effectiveness of this analysis by designing a novel computationally efficient baseline model that achieves state-of-the-art performance on the proposed metrics while addressing all the above-mentioned artifacts.	This paper introduces a novel evaluation protocol for text-to-3D generation models and proposes a new baseline method based on Gaussian Splatting.	Existing evaluation methods for text-to-3D models lack objectivity and comprehensiveness, hindering systematic progress in the field.	The authors propose quantitative metrics to evaluate the frequency of the "Janus problem" (object duplication across viewpoints), text and 3D alignment, and the realism of generated 3D models. They also present a two-stage generation method using MVDream and Gaussian Splatting for efficiency and realism.	The proposed method achieves state-of-the-art performance on the introduced metrics, demonstrating its effectiveness in mitigating the Janus problem and generating high-fidelity 3D content. Analysis of existing methods reveals a high prevalence of the Janus problem, highlighting the need for robust evaluation. The study confirms a trade-off between realism and the Janus problem in refinement stages, emphasizing the need for balanced optimization.	The paper relies on manual inspection for detecting the Janus problem, calling for future development of automatic evaluation methods. Future work can explore further efficiency improvements and leverage real-world and synthetic data for enhanced diversity and realism in generated 3D content.	text-to-3d generation, score distillation sampling, gaussian splatting, evaluation protocol, janus problem
2402.18331 Report	FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes	Ziying Pan, Kun Wang, Gang Li, Feihong He, Xiwang Li, Yongxuan Lai	The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: https://finediffusion.github.io/.	This paper presents FineDiffusion, a parameter-efficient fine-tuning strategy for large-scale (10,000+ categories) fine-grained image generation using diffusion models.	Large-scale fine-grained image generation is challenging and computationally expensive using traditional diffusion model training. This work offers a faster, more efficient approach.	FineDiffusion leverages a pre-trained DiT model and fine-tunes only specific parameters: a proposed TieredEmbedder (encoding hierarchical class labels), bias terms, and normalization layers. A novel sampling method using superclass-conditioned guidance further improves generation.	FineDiffusion achieves a 1.56x training speed-up and requires storing only 1.77% of total model parameters compared to full fine-tuning. It achieves state-of-the-art FID of 9.776 on the iNaturalist 2021 mini dataset (10,000 classes). FineDiffusion outperforms other parameter-efficient fine-tuning methods (BitFit, DiffFit) in FID and LPIPS scores on various fine-grained datasets.	The current implementation focuses on image generation from class labels; exploring text-guided fine-grained generation is a potential future direction. Investigating the impact of different pre-trained diffusion models and dataset scales on FineDiffusion's performance is of interest.	fine-grained image generation, diffusion models, parameter-efficient fine-tuning, classifier-free guidance, hierarchical class embedding
2402.18192 Report	Misalignment-Robust Frequency Distribution Loss for Image Transformation	Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, Lin Ma	This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challenge, this paper introduces a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Specifically, we transform image features into the frequency domain using Discrete Fourier Transformation (DFT). Subsequently, frequency components (amplitude and phase) are processed separately to form the FDL loss function. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain. Extensive experimental evaluations, focusing on image enhancement and super-resolution tasks, demonstrate that FDL outperforms existing misalignment-robust loss functions. Furthermore, we explore the potential of our FDL for image style transfer that relies solely on completely misaligned data. Our code is available at: https://github.com/eezkni/FDL	This paper introduces Frequency Distribution Loss (FDL), a novel loss function designed to enhance the robustness of image transformation models when dealing with misaligned training data.	Existing image transformation methods heavily rely on precisely aligned datasets, which are often challenging to obtain, particularly for tasks involving natural distortions like style transfer.	FDL leverages the Discrete Fourier Transform (DFT) to transform image features into the frequency domain. It then calculates the Sliced Wasserstein Distance (SWD) between the amplitude and phase components of the predicted and target image features.	FDL consistently outperforms existing misalignment-robust loss functions in image enhancement and super-resolution tasks. The method effectively mitigates artifacts and preserves structural details even in the presence of significant geometric misalignments. FDL demonstrates promising results for image style transfer, effectively capturing and transferring structural styles.	The choice of feature extractor and the weighting parameter for different frequency components may require task-specific tuning. Future work could explore assigning different attention weights to distinct frequency domain regions for further performance improvement.	image transformation, misaligned data, frequency distribution loss, deep learning, computer vision
2402.18068 Report	SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model	Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, Bo Zhao	In the rapidly evolving area of image synthesis, a serious challenge is the presence of complex artifacts that compromise perceptual realism of synthetic images. To alleviate artifacts and improve quality of synthetic images, we fine-tune Vision-Language Model (VLM) as artifact classifier to automatically identify and classify a wide range of artifacts and provide supervision for further optimizing generative models. Specifically, we develop a comprehensive artifact taxonomy and construct a dataset of synthetic images with artifact annotations for fine-tuning VLM, named SynArtifact-1K. The fine-tuned VLM exhibits superior ability of identifying artifacts and outperforms the baseline by 25.66%. To our knowledge, this is the first time such end-to-end artifact classification task and solution have been proposed. Finally, we leverage the output of VLM as feedback to refine the generative model for alleviating artifacts. Visualization results and user study demonstrate that the quality of images synthesized by the refined diffusion model has been obviously improved.	This paper presents SynArtifact-1K, the first synthetic image dataset annotated with artifact categories, descriptions, and coordinates, to address the challenge of artifact classification and alleviation in synthetic images.	Existing image synthesis methods often lack the ability to effectively identify and alleviate artifacts, hindering the realism of generated images. This work provides a solution by classifying various types of artifacts and using this information to improve generative models.	The authors first create a comprehensive taxonomy of common artifacts. Then, they construct SynArtifact-1K and use it to fine-tune a Vision-Language Model (VLM) for artifact classification. Finally, they leverage the output of the VLM as AI feedback to guide the optimization of a diffusion model through Reinforcement Learning from AI Feedback (RLAIF).	The fine-tuned VLM outperforms the baseline by 25.66% in classification accuracy on SynArtifact-1K, demonstrating its effectiveness in artifact identification. The VLM demonstrates promising preliminary results for artifact detection, paving the way for more explainable quality assessment of synthetic images. By integrating the VLM feedback, the refined diffusion model generates higher-quality images with fewer artifacts, as evidenced by visualization and user study.	The size of the SynArtifact-1K dataset is limited, and a larger dataset could potentially further improve the performance of both artifact classification and alleviation. The VLM used for artifact detection lacks inherent localization abilities, suggesting potential for future work exploring VLMs with stronger visual grounding.	image synthesis, artifact classification, artifact alleviation, vision-language model, reinforcement learning from ai feedback
2402.18039 Report	ResLoRA: Identity Residual Mapping in Low-Rank Adaption	Shuhua Shi, Shaohan Huang, Minghui Song, Zhoujun Li, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang	As one of the most popular parameter-efficient fine-tuning (PEFT) methods, low-rank adaptation (LoRA) is commonly applied to fine-tune large language models (LLMs). However, updating the weights of LoRA blocks effectively and expeditiously is challenging due to the long calculation path in the original model. To address this, we propose ResLoRA, an improved framework of LoRA. By adding residual paths during training and using merging approaches to eliminate these extra paths during inference, our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA. The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method. To the best of our knowledge, ResLoRA is the first work that combines the residual path with LoRA. The code of our method is available at https://github.com/microsoft/LMOps/tree/main/reslora .	This paper introduces ResLoRA, a novel framework that enhances Low-Rank Adaptation (LoRA) for fine-tuning large language models (LLMs) by incorporating residual connections to expedite and stabilize the training process.	LoRA, while effective, suffers from limitations in efficient weight updates due to long calculation paths. ResLoRA addresses this issue, aiming for faster convergence and improved performance.	ResLoRA integrates residual paths within LoRA blocks during training, exploring three structures: input-shortcut (is), block-shortcut (bs), and middle-shortcut (ms). Merging approaches are then employed to eliminate the extra paths during inference, ensuring no additional computational cost.	ResLoRA consistently outperforms standard LoRA and other variants in natural language generation (NLG) and natural language understanding (NLU) tasks, achieving accuracy improvements ranging from 1% to 20%. Experiments demonstrate that incorporating residual paths accelerates training convergence, evidenced by significantly lower loss values compared to standard LoRA. Analysis of trained matrix weights reveals that ResLoRA promotes more complex weight patterns, potentially contributing to its superior performance.	While not adding trainable parameters, ResLoRA's training incurs higher computational cost than standard LoRA due to the use of previous blocks in calculations. The merging approaches, while effective, introduce minor accuracy degradation, necessitating the development of more efficient merging strategies.	parameter-efficient fine-tuning, large language models, low-rank adaptation, residual networks, deep learning
2402.17910 Report	Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models	Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Hamid Laga, Farid Boussaid	While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and ii) attribute binding, guaranteeing that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models, markedly enhancing model performance in addressing the key challenges. We evaluate our technique using the established CompBench and TIFA score benchmarks, demonstrating significant performance improvements compared to existing methods. The source code will be made publicly available at https://github.com/nextaistudio/BoxIt2BindIt.	This paper introduces Box-it-to-Bind-it (B2B), a training-free plug-and-play module for enhancing spatial control and semantic accuracy in text-to-image diffusion models.	Existing text-to-image models struggle with accurately binding attributes to objects and precisely controlling object placement according to a specified layout.	B2B uses a two-step process: 1) Object generation: adjusts latent encoding to guarantee object generation within specified bounding boxes, utilizing LLMs for layout guidance. 2) Attribute binding: ensures generated objects adhere to their specified attributes in the prompt.	B2B achieves state-of-the-art results on CompBench and TIFA benchmarks for color and texture binding. It significantly improves spatial reasoning, enabling more precise object placement within specified layouts. B2B's plug-and-play nature is demonstrated by its successful integration with both Stable Diffusion and GLIGEN models, enhancing their performance.	The paper acknowledges the potential for further improvement in spatial reasoning. Future work may explore extending B2B to handle more complex relationships between objects and attributes.	text-to-image generation, diffusion models, attribute binding, spatial control, layout guidance
2402.17863 Report	Vision Transformers with Natural Language Semantics	Young Kyung Kim, J. Matías Di Martino, Guillermo Sapiro	Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data while maintaining similar or superior performance. Furthermore, sViT demonstrates significant superiority in out-of-distribution generalization and robustness to natural distribution shifts, attributed to its scale invariance semantic characteristic. Notably, the use of semantic tokens significantly enhances the model's interpretability. Lastly, the proposed paradigm facilitates the introduction of new and powerful augmentation techniques at the token (or segment) level, increasing training data diversity and generalization capabilities. Just as sentences are made of words, images are formed by semantic objects; our proposed methodology leverages recent progress in object segmentation and takes an important and natural step toward interpretable and robust vision transformers.	This paper introduces Semantic Vision Transformers (sViT), a novel vision transformer model that leverages semantic segmentation for tokenization, enhancing performance and interpretability.	Current Vision Transformers (ViT) lack semantic information in their tokens, hindering their interpretability and efficiency, especially compared to NLP transformers that process meaningful words.	sViT utilizes the Segment Anything Model (SAM) for semantic segmentation, treating each segment as a token. It introduces positional and scale embeddings based on segment location and size. The model is trained on scene recognition and object-centric datasets.	sViT outperforms ViT on non-object-centric datasets, especially with limited data. sViT exhibits superior generalization to object-centric datasets, demonstrating scale invariance. sViT significantly improves interpretability, highlighting semantically meaningful regions.	sViT has higher computational cost during inference due to the additional segmentation step. Future work could explore more efficient segmentation models to reduce computational overhead.	vision transformer, semantic segmentation, interpretability, out-of-distribution generalization, data augmentation
2402.17766 Report	ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, He Wang, Li Yi, Kaisheng Ma	This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated evaluation benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.	Presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.	To bridge the gap between LLMs and 3D object understanding, particularly in embodied interaction, where precise geometry and interaction knowledge are crucial.	Leverages 3D point clouds as inputs, introduces selective multi-view distillation in the 3D encoder (ReCon), and employs 3D visual instruction tuning with data constructed using GPT-4V.	ReCon achieves state-of-the-art performance in 3D object recognition, surpassing previous best records on ScanObjectNN and ModelNet40. ShapeLLM successfully unifies various downstream tasks, including 3D captioning, 3D VQA, embodied task planning & decomposition, and 3D embodied visual grounding. On the newly constructed 3D MM-Vet benchmark, ShapeLLM outperforms previous 3D point cloud-based methods, achieving 49.3% total accuracy.	ShapeLLM's training data for embodied interaction is limited to indoor articulated furniture. Real-time deployment requires addressing efficiency concerns, potentially through model compression techniques.	multimodal large language model, 3d object understanding, embodied interaction, 3d point cloud processing, visual instruction tuning
2402.17726 Report	VRP-SAM: SAM with Visual Reference Prompt	Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li	In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at \url{https://github.com/syp2ysy/VRP-SAM}	This paper proposes VRP-SAM, an extension of the Segment Anything Model (SAM) by incorporating a Visual Reference Prompt (VRP) encoder, enabling SAM to perform visual reference segmentation using annotated reference images as prompts.	Existing prompt formats in SAM pose challenges for complex scenes and numerous images as they require user familiarity with target objects and custom prompts for each image. VRP-SAM addresses these limitations by using annotated reference images to guide segmentation, improving efficiency and reducing reliance on user input.	VRP-SAM introduces a VRP encoder that processes annotated reference images and generates prompt embeddings. These embeddings guide SAM's mask decoder to segment target objects with similar semantics. The VRP encoder utilizes meta-learning by extracting object prototypes from reference images to enhance target object representation in both reference and target images. Learnable queries interact with enhanced features to generate prompt embeddings for the SAM decoder.	VRP-SAM achieves state-of-the-art performance in visual reference segmentation with minimal learnable parameters, outperforming previous methods on PASCAL-5i and COCO-20i datasets. VRP-SAM effectively addresses limitations of geometric prompts, demonstrating superior performance by avoiding false-positive prompts often generated by geometric approaches. VRP-SAM shows strong generalization capability, effectively handling unknown objects and cross-domain scenarios, as evidenced by domain shift experiments and visualization on diverse image styles.	The current work focuses on few-shot semantic segmentation, with future exploration aimed at extending VRP-SAM to a wider range of vision tasks such as video object segmentation and object tracking. Further investigation is needed to explore the full potential of VRP-SAM in more complex real-world applications and diverse datasets.	visual reference segmentation, segment anything model (sam), meta-learning, visual reference prompt, few-shot learning
2402.17563 Report	Structure-Guided Adversarial Training of Diffusion Models	Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, Bin Cui	Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.	Introduces Structure-guided Adversarial training of Diffusion Models (SADM) that compels the model to learn manifold structures between samples in each training batch, enhancing data distribution modeling.	Existing diffusion models focus on instance-level optimization, neglecting valuable structural information within mini-batches that indicate pair-wise relationships among samples, hindering accurate data distribution modeling.	Employs adversarial training between the diffusion generator and a novel structure discriminator. The discriminator distinguishes real manifold structures from generated ones, encouraging the generator to learn authentic data manifold structures.	Significantly improves existing diffusion transformers and surpasses existing methods in image generation and cross-domain fine-tuning across 12 datasets. Achieves state-of-the-art FID scores on ImageNet for class-conditional image generation (1.58 for 256x256 and 2.11 for 512x512 resolutions). Demonstrates potential for rapid adaptation to new domains in cross-domain fine-tuning tasks.	Reliance on pre-trained feature extractors for the structure discriminator. Potential limitations in generalizing to highly complex or diverse datasets.	diffusion models, generative models, adversarial training, manifold learning, image generation
2402.17485 Report	EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	Linrui Tian, Qi Wang, Bang Zhang, Liefeng Bo	In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.	This paper proposes EMO, an expressive audio-driven portrait-video generation framework that generates portrait videos with expressive facial expressions and head poses from a single reference image and audio.	Existing talking head generation methods often lack realism and expressiveness, especially when it comes to capturing subtle facial movements and diverse speaking styles.	EMO leverages a direct audio-to-video synthesis approach based on diffusion models. It utilizes audio embeddings for motion and expression, reference image features for identity preservation, and weak control mechanisms for stability and consistency.	EMO generates high-quality talking head videos with natural head movements and vivid expressions synchronized with the input audio. The framework can generate videos of any duration and adapt to various portrait styles, including realistic, anime, and 3D. Quantitative evaluations on the HDTF dataset show that EMO outperforms state-of-the-art methods in terms of video quality (FVD), frame quality (FID), identity preservation (F-SIM), and expressiveness (E-FID).	The method is more computationally expensive than non-diffusion-based approaches. The lack of explicit control signals for body parts can sometimes lead to artifacts.	diffusion models, video generation, talking head, audio-to-video synthesis, expressive facial animation
2402.17412 Report	DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models	Shyam Marjit, Harshit Singh, Nityanand Mathur, Sayak Paul, Chia-Mu Yu, Pin-Yu Chen	In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis. Addressing these constraints, we introduce \textbf{\textit{DiffuseKronA}}, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by 35\% and 99.947\% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, \textit{DiffuseKronA} mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning. Furthermore, a more controllable decomposition makes \textit{DiffuseKronA} more interpretable and even can achieve up to a 50\% reduction with results comparable to LoRA-Dreambooth. Evaluated against diverse and complex input images and text prompts, \textit{DiffuseKronA} consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, all the while upholding exceptional parameter efficiency, thus presenting a substantial advancement in the field of T2I generative modeling. Our project page, consisting of links to the code, and pre-trained checkpoints, is available at https://diffusekrona.github.io/.	Introduces DiffuseKronA, a novel Kronecker product-based adaptation module for fine-tuning text-to-image diffusion models that significantly reduces parameter count while enhancing image synthesis quality.	Addresses limitations of existing methods like DreamBooth and LoRA-DreamBooth, which suffer from high parameter requirements, hyperparameter sensitivity, and a trade-off between parameter efficiency and image quality.	Leverages the Kronecker product to capture structured relationships in weight matrices, enabling more efficient and expressive parameter updates compared to low-rank decomposition methods.	Reduces trainable parameters by 35% compared to LoRA-DreamBooth and 99.947% compared to DreamBooth. Demonstrates enhanced stability across a wide range of hyperparameters, mitigating the need for extensive fine-tuning. Produces higher-quality images with improved fidelity, more accurate color distribution, and better text alignment compared to existing state-of-the-art methods.	Optimal Kronecker factor configuration requires manual exploration. Further research can explore applying DiffuseKronA to other diffusion model architectures beyond SDXL.	text-to-image generation, diffusion models, parameter-efficient fine-tuning, kronecker product, image synthesis
2402.17403 Report	Sora Generates Videos with Stunning Geometrical Consistency	Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng	The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generated videos based on their adherence to real-world physics principles. We employ a method that transforms the generated videos into 3D models, leveraging the premise that the accuracy of 3D reconstruction is heavily contingent on the video quality. From the perspective of 3D reconstruction, we use the fidelity of the geometric constraints satisfied by the constructed 3D models as a proxy to gauge the extent to which the generated videos conform to real-world physics rules. Project page: https://sora-geometrical-consistency.github.io/	This paper introduces a novel benchmark to evaluate the physical realism, specifically the geometric consistency, of videos generated by the state-of-the-art text-to-video model, Sora.	Existing metrics for evaluating video generation models fail to capture the adherence of generated content to real-world physics, a crucial aspect of realism, especially in light of models like Sora demonstrating such capabilities.	The authors leverage the principles of 3D reconstruction, using the quality of 3D models generated from the videos as a proxy for their geometric consistency. They employ traditional computer vision techniques like Structure-from-Motion (SfM) and Gaussian Splatting, along with metrics based on feature matching and reprojection errors.	Sora-generated videos exhibit significantly higher geometric consistency compared to videos generated by Pika Labs and Gen-2, evidenced by better 3D reconstruction quality and more accurate feature matching. Sora maintains this geometric consistency over longer video durations, indicating its ability to preserve physical and geometric properties over time. Visualizations of point clouds, Gaussian Splatting renderings, and stereo matching results further confirm the superior geometric fidelity of Sora-generated videos.	The study primarily focuses on geometric consistency and acknowledges the need to incorporate additional physics-based metrics like texture authenticity and object interaction logic in future work. The use of traditional computer vision techniques for 3D reconstruction might be complemented by exploring deep learning-based methods like NeRF for potentially more robust evaluations.	video generation, geometric consistency, text-to-video synthesis, 3d reconstruction, sora
2402.17323 Report	SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection	Junsu Kim, Hoseong Cho, Jihyeon Kim, Yihalem Yimolal Tiruneh, Seungryul Baek	In the field of class incremental learning (CIL), generative replay has become increasingly prominent as a method to mitigate the catastrophic forgetting, alongside the continuous improvements in generative models. However, its application in class incremental object detection (CIOD) has been significantly limited, primarily due to the complexities of scenes involving multiple labels. In this paper, we propose a novel approach called stable diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a diffusion-based generative model with pre-trained text-to-diffusion networks to generate realistic and diverse synthetic images. SDDGR incorporates an iterative refinement strategy to produce high-quality images encompassing old classes. Additionally, we adopt an L2 knowledge distillation technique to improve the retention of prior knowledge in synthetic images. Furthermore, our approach includes pseudo-labeling for old objects within new task images, preventing misclassification as background elements. Extensive experiments on the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing algorithms, achieving a new state-of-the-art in various CIOD scenarios. The source code will be made available to the public.	This paper introduces Stable Diffusion Deep Generative Replay (SDDGR), a novel method leveraging pre-trained text-to-image diffusion models to generate synthetic images for mitigating catastrophic forgetting in class incremental object detection (CIOD).	Existing CIOD methods struggle to retain knowledge of previous classes when learning new ones, hindering their ability to handle complex, multi-label scenes. SDDGR addresses this by utilizing the power of diffusion models for realistic image synthesis and knowledge preservation.	SDDGR utilizes a pre-trained text-to-image diffusion model with grounding inputs (classes and bounding boxes) to generate images of past objects. It then uses iterative refinement and L2 knowledge distillation to improve image quality and transfer knowledge to the updated model. Additionally, it employs pseudo-labeling on new task images to prevent misclassification of old objects as background.	SDDGR outperforms existing CIOD methods, achieving state-of-the-art accuracy on the COCO dataset in both two-phase and multi-phase learning scenarios. Ablation studies show the significance of each component (refinement, distillation, pseudo-labeling) in boosting performance and reducing forgetting. The method demonstrates robustness to variations in hyperparameters like generated image count and refinement threshold.	The method assumes access to a powerful pre-trained diffusion model, which may not always be readily available. Future work could explore different prompt engineering techniques and diffusion model architectures for further performance improvement in CIOD.	class incremental learning, object detection, generative replay, diffusion models, catastrophic forgetting
2402.17298 Report	ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks	Yang Liu, Xiaomin Yu, Gongyu Zhang, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin	In this study, we address the challenging task of bridging the modality gap between learning from language and inference for visual tasks, including Visual Question Answering (VQA), Image Captioning (IC) and Visual Entailment (VE). We train models for these tasks in a zero-shot cross-modal transfer setting, a domain where the previous state-of-the-art method relied on the fixed scale noise injection, often compromising the semantic content of the original modality embedding. To combat it, we propose a novel method called Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively widens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 1.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks. The code will be released.	This paper introduces ArcSin, a novel adaptive noise injection technique for language-driven visual tasks that effectively bridges the modality gap between text and image data.	Bridging the modality gap is crucial for zero-shot cross-modal transfer learning in vision-language tasks, enabling models to understand and interpret visual information using only textual data, which is abundant and cost-effective to acquire.	ArcSin utilizes adaptive ranged noise injection based on cosine similarity and feature magnitude to expand the text feature domain while preserving semantic content. It also employs a similarity pool strategy to further broaden the noise scale and enhance domain generalization.	ArcSin outperforms previous state-of-the-art methods in zero-shot cross-modal transfer learning for various vision-language tasks, including image captioning, visual question answering, and visual entailment. The adaptive noise injection technique proves more effective than fixed-scale noise injection, demonstrating the importance of content preservation during domain generalization. The performance improvement is consistent across various contrastive and language backbone models, highlighting the robustness and generalizability of the proposed method.	ArcSin may struggle with disentangling and interpreting intricate visual details or differentiating between similar foreground and background elements. Future work will focus on enhancing the comprehension of complex visual features through exclusively text-based learning.	cross-modal transfer learning, vision-language tasks, modality gap, noise injection, zero-shot learning
2402.17292 Report	DivAvatar: Diverse 3D Avatar Generation with a Single Prompt	Weijing Tao, Biwen Lei, Kunhao Liu, Shijian Lu, Miaomiao Cui, Xuansong Xie, Chunyan Miao	Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D avatars from a single text prompt. Different from most existing work that exploits scene-specific 3D representations such as NeRF, DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse avatar generation from simply noise sampling in inference time. DivAvatar has two key designs that help achieve generation diversity and visual quality. The first is a noise sampling technique during training phase which is critical in generating diverse appearances. The second is a semantic-aware zoom mechanism and a novel depth loss, the former producing appearances of high textual fidelity by separate fine-tuning of specific body parts and the latter improving geometry quality greatly by smoothing the generated mesh in the features space. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.	DivAvatar: a novel framework for generating diverse 3D avatars from a single text prompt.	Most existing text-to-avatar methods lack diversity, producing avatars with subtle differences. DivAvatar addresses this by enabling the generation of a variety of distinct and realistic avatars, crucial for inclusivity and efficiency in virtual environments.	DivAvatar finetunes a pretrained 3D generative model (EVA3D) with a novel noise sampling technique during training, a semantic-aware zoom mechanism for textual fidelity, and a feature-based depth loss for geometry refinement. It leverages the inherent diversity of GANs and incorporates a diffusion prior (SDS) for text-guided generation.	DivAvatar generates significantly more diverse avatars compared to existing methods like Stable Dreamfusion and AvatarCraft. The noise sampling technique is crucial for achieving diversity, while the semantic zoom and depth loss improve texture fidelity and geometry quality respectively. The method allows for flexible control over diversity levels by adjusting the probability of random noise sampling during training.	Generated textures lack photorealistic details, requiring additional mesh optimization. Limited diversity observed for specific uniforms, possibly due to the training dataset bias of the underlying generative model (EVA3D).	text-to-3d, avatar generation, generative models, diversity, diffusion models
2402.17245 Report	Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation	Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, Suhail Doshi	In this work, we share three insights for achieving state-of-the-art aesthetic quality in text-to-image generative models. We focus on three critical aspects for model improvement: enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details. First, we delve into the significance of the noise schedule in training a diffusion model, demonstrating its profound impact on realism and visual fidelity. Second, we address the challenge of accommodating various aspect ratios in image generation, emphasizing the importance of preparing a balanced bucketed dataset. Lastly, we investigate the crucial role of aligning model outputs with human preferences, ensuring that generated images resonate with human perceptual expectations. Through extensive analysis and experiments, Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic quality under various conditions and aspect ratios, outperforming both widely-used open-source models like SDXL and Playground v2, and closed-source commercial systems such as DALLE 3 and Midjourney v5.2. Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.	Presents Playground v2.5, a text-to-image model with state-of-the-art aesthetic quality achieved by focusing on color and contrast enhancement, multi-aspect ratio generation, and human-centric detail refinement.	Addresses the limitations of existing models in producing visually compelling images that align with human preferences, crucial for real-world applications and user satisfaction.	Employs the EDM framework for enhanced noise scheduling and color vibrancy, implements a balanced dataset for multi-aspect ratio generation, and utilizes a human-in-the-loop approach for aligning outputs with human preferences.	Outperforms state-of-the-art models, including Midjourney 5.2 and DALL·E 3, in aesthetic quality based on user studies. Generates high-quality images across various aspect ratios, overcoming limitations of previous models. Exhibits superior performance in rendering human-centric details, such as facial features and overall lighting.	Future work will focus on improving text-to-image alignment and model variation capabilities. Exploration of new architectures for enhanced image generation and editing.	text-to-image generation, diffusion models, aesthetic quality, human preference alignment, multi-aspect ratio generation
2402.17214 Report	CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization	Hao-Yang Peng, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu	In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.	Presents CharacterGen, an efficient framework for generating high-quality 3D character models in a canonical pose from single images, overcoming challenges posed by diverse body poses and self-occlusion.	Generating high-quality 3D characters from single images is crucial for various applications, but existing methods struggle with diverse poses and self-occlusion. This work offers a solution to these problems and streamlines the creation process.	CharacterGen uses a two-stage approach: 1) an image-conditioned multi-view diffusion model to canonicalize input poses to a standard 'A-pose' while generating consistent multi-view images, and 2) a transformer-based sparse-view reconstruction model to create the 3D character from these images.	Generates high-quality 3D characters in a canonical pose, suitable for rigging and animation. Successfully addresses the challenges of self-occlusion and pose ambiguity in character generation. Outperforms existing methods in terms of generation quality and speed, as evidenced by quantitative and qualitative comparisons.	May not perfectly capture information from extreme poses or uncommon viewpoints. Could further enhance texture quality by incorporating non-photorealistic rendering techniques.	3d character generation, multi-view diffusion model, pose canonicalization, sparse-view reconstruction, texture refinement
2402.17177 Report	Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun	Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.	\texttt{Sora} is a text-to-video generative AI model that can produce videos of up to 1-minute long with high quality from text instructions.	\texttt{Sora} is a breakthrough in AI-powered vision generation with the potential to revolutionize various fields, including film-making, education, and accessibility.	\texttt{Sora} employs a pre-trained diffusion transformer model trained on a massive dataset of text-video pairs. It utilizes spacetime latent patches to compress and process video data efficiently. It also leverages techniques like caption improvement for enhanced instruction following and prompt engineering for guiding video generation.	Sora can generate high-quality videos of up to 1 minute in length from text prompts, including complex scenes with multiple characters and intricate backgrounds. It demonstrates emergent abilities in simulating aspects of the physical world and digital environments without explicit 3D modeling. Sora allows for flexible video generation, accommodating variable durations, resolutions, and aspect ratios.	Challenges remain in accurately simulating complex physical interactions and maintaining spatial and temporal consistency in intricate scenes. Sora currently has a limitation in generating videos longer than one minute, restricting its application in scenarios requiring extended content.	text-to-video generation, ai-powered vision, diffusion models, transformer models, generative ai
2402.17139 Report	Video as the New Language for Real-World Decision Making	Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, Dale Schuurmans	Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.	This paper argues that video generation will be as impactful for the physical world as language modeling is for the digital world, serving as planners, agents, compute engines, and simulators.	Video captures crucial physical world information difficult to express in text, offering potential benefits to robotics, self-driving, and science.	The authors analyze how video generation, like language modeling, provides a unified representation and task interface, enabling techniques like in-context learning, planning, and reinforcement learning.	Video generation can solve diverse vision tasks, answer questions with detailed actions, and exhibit visual reasoning capabilities. Action-conditioned video generation can simulate complex game environments and generate novel ones from image prompts. Video generation serves as a simulator for robotics, self-driving (with domain randomization), and scientific processes, enabling policy optimization and mitigating hardware limitations.	Current video datasets have limited coverage and lack sufficient annotations. Lack of a single best model architecture for video generation hinders progress, requiring exploration of hybrid approaches.	video generation, language modeling, embodied ai, simulation, real-world applications
2402.17128 Report	OSCaR: Object State Captioning and State Change Representation	Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu	The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.	This paper introduces a new task and benchmark, OSCaR, for understanding object states and their changes using natural language, leveraging a GPT-4V-assisted data generation pipeline.	Understanding object state change is crucial for AI agents to reason, learn, and interact with the physical world, bridging the gap between human and machine perception.	The authors collected egocentric videos from EPIC-KITCHENS and Ego4D, used GPT-4V and human annotations to generate captions, questions, and conversations about object states, and fine-tuned LLaVA on this data.	OSCaR outperforms previous state-of-the-art models in text generation metrics (BLEU, ROUGE) for describing object states. Human evaluation shows that OSCaR achieves near-parity with GPT-4V in caption quality. The benchmark includes open-world evaluations on objects unseen during training, highlighting the challenge of generalizability in object state understanding.	The study lacks audio integration, limiting its applicability to scenarios where sound is crucial. Tracking long-term state transitions remains a challenge due to the limitations of current models in capturing long-term information.	object state understanding, egocentric vision, multimodal large language models, gpt-4v, benchmarking
2402.17113 Report	Transparent Image Layer Diffusion using Latent Transparency	Lvmin Zhang, Maneesh Agrawala	We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.	Presents LayerDiffuse, enabling large-scale pretrained latent diffusion models to generate single or multiple transparent image layers.	Addresses the lack of research in layered/transparent content generation despite its high demand in visual content editing.	Encodes transparency as a "latent transparency" offset in the latent space of a pretrained model, preserving its quality. Trains with 1M transparent image layer pairs collected via human-in-the-loop.	Generates high-quality transparent images with diverse content and effects (glass, hair, fire, etc.). Produces harmonious compositions of multiple layers with consistent illumination and geometry. Integrates with control models (e.g., ControlNet) for enhanced functionality (e.g., structure control).	Trade-off exists between generating "clean" transparent elements and achieving "harmonious blending". Generating backgrounds for clean transparent elements without specific illumination or shadow can be challenging.	transparent image generation, layered image generation, latent diffusion models, stable diffusion, image synthesis
2402.16991 Report	A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data	Antonio Sclocchi, Alessandro Favero, Matthieu Wyart	Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organised in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterises the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.	This paper studies how reversing time in denoising diffusion models reveals the hierarchical and compositional nature of data, particularly in image generation.	Understanding this interplay between time and feature hierarchy in diffusion models can shed light on their remarkable success, including generalization abilities and data efficiency.	The authors use a combination of theoretical analysis with a hierarchical generative model (Random Hierarchy Model) and empirical experiments on ImageNet. They analyze the denoising dynamics, specifically the probability of reconstructing features at different hierarchical levels as a function of time and noise.	A phase transition exists in the denoising process where the probability of maintaining the original image class sharply drops at a specific time/noise level. Low-level features of an image can change even at early denoising times, while the class remains stable. Beyond the class transition, the model may still utilize low-level features from the original image to compose a new image belonging to a different class.	The theoretical analysis assumes a simplified noise model and mean-field approximation. Future work can explore these phenomena in other data domains like text, using diffusion language models.	diffusion models, generative models, hierarchical data, compositionality, phase transition
2402.16936 Report	Disentangled 3D Scene Generation with Layout Learning	Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros, Aleksander Holynski	We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/	This paper introduces a novel method, called layout learning, for generating 3D scenes that are disentangled into their component objects using pretrained text-to-image models.	Disentangling objects in 3D scenes is crucial for enabling object-level manipulation and editing, facilitating more controllable and interactive 3D content creation.	The method optimizes multiple NeRFs, each representing a different object, along with a set of layouts defining their spatial arrangements. These NeRFs are jointly trained to produce realistic scenes evaluated by a pretrained text-to-image model.	Layout learning successfully generates 3D scenes where individual NeRFs correspond to distinct objects, enabling object-level manipulation. Quantitative evaluation using CLIP scores demonstrates that the generated scenes exhibit high visual quality and object disentanglement, nearing supervised per-object rendering performance. The method facilitates several applications, including conditional scene generation around a given object, arranging 3D assets into semantically valid configurations, and decomposing existing scenes into objects.	The model may encounter difficulties with object segmentation, occasionally grouping objects that always appear together or struggling with scenes containing many small objects. Despite measures to ensure diversity, learned layouts can converge to overly similar configurations, limiting the variability of object arrangements.	text-to-3d, disentanglement, unsupervised learning, object discovery, 3d scene generation
2402.16889 Report	Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation	Aditya Desu, Xuanli He, Qiongkai Xu, Wei Lu	As machine- and AI-generated content proliferates, protecting the intellectual property of generative models has become imperative, yet verifying data ownership poses formidable challenges, particularly in cases of unauthorized reuse of generated data. The challenge of verifying data ownership is further amplified by using Machine Learning as a Service (MLaaS), which often functions as a black-box system. Our work is dedicated to detecting data reuse from even an individual sample. Traditionally, watermarking has been leveraged to detect AI-generated content. However, unlike watermarking techniques that embed additional information as triggers into models or generated content, potentially compromising output quality, our approach identifies latent fingerprints inherently present within the outputs through re-generation. We propose an explainable verification procedure that attributes data ownership through re-generation, and further amplifies these fingerprints in the generative models through iterative data re-generation. This methodology is theoretically grounded and demonstrates viability and robustness using recent advanced text and image generative models. Our methodology is significant as it goes beyond protecting the intellectual property of APIs and addresses important issues such as the spread of misinformation and academic misconduct. It provides a useful tool to ensure the integrity of sources and authorship, expanding its application in different scenarios where authenticity and ownership verification are essential.	The paper proposes a novel approach for verifying data ownership in generative models, particularly in black-box settings, by leveraging inherent model fingerprints through re-generation.	The increasing use of generative AI models raises concerns about unauthorized data reuse and plagiarism. Existing watermarking techniques can impact output quality, and classification-based methods may lack robustness. The proposed method addresses these challenges by utilizing the unique characteristics of generative models for verification.	The methodology involves two stages: Generation and Verification. The Generation stage uses iterative re-generation to amplify model fingerprints in outputs. The Verification stage compares the distance between the original data and re-generated versions using authentic and contrasting models. This is grounded in fixed-point theory, ensuring convergence and distinct fingerprint separation.	Iterative re-generation effectively enhances model fingerprints, leading to converging distances between consecutive re-generations. Authentic models consistently exhibit smaller re-generation distances compared to contrasting models, facilitating ownership verification. The method achieves high precision and recall in verifying data ownership across various text and image generation models and tasks.	Robustness against sophisticated paraphrasing attacks is limited. The effectiveness may be compromised by significant alterations to the generated content.	generative models, data ownership verification, model fingerprints, re-generation, intellectual property protection
2402.16843 Report	Multi-LoRA Composition for Image Generation	Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, Weizhu Chen	Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: LoRA Switch, which alternates between different LoRAs at each denoising step, and LoRA Composite, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition.	This paper introduces two novel training-free methods, LoRA Switch and LoRA Composite, for composing multiple Low-Rank Adaptations (LoRAs) in text-to-image generation, improving the accuracy and quality of composing multiple user-specified elements in generated images.	Existing LoRA composition methods struggle with effectively integrating multiple elements, especially as the number of LoRAs increases, limiting the controllability and complexity of generated images. This paper addresses these limitations by focusing on the denoising process of diffusion models.	LoRA Switch alternates between activating different LoRAs at each denoising step, ensuring each element receives dedicated attention. LoRA Composite leverages all LoRAs simultaneously, drawing inspiration from classifier-free guidance to provide balanced guidance throughout image generation.	Both LoRA Switch and LoRA Composite outperform the conventional LoRA merging approach, particularly when composing a higher number of LoRAs. LoRA Switch demonstrates superior performance in composition quality, while LoRA Composite excels in overall image quality. The study reveals a style dependency, with LoRA Switch excelling in realistic styles and LoRA Composite proving more effective in anime styles.	Composable image generation, particularly with multiple elements, remains challenging despite the improvements offered by the proposed methods. The evaluation using GPT-4V, while effective, reveals a positional bias that necessitates averaging scores across different input orders to ensure fairness.	image generation, composable image generation, low-rank adaptation (lora), diffusion models, multimodal evaluation
2402.16828 Report	Training Neural Networks from Scratch with Parallel Low-Rank Adapters	Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal	The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.	This paper proposes LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm, to enable the pre-training of large neural networks from scratch using low-rank adapters, addressing the limitations of standard LoRA in this context.	Training large models is challenging due to hardware constraints (compute, memory, communication). LTE aims to mitigate these issues by utilizing parallel low-rank updates, making large model pre-training feasible on low-memory devices.	LTE employs multiple low-rank adapter heads trained in parallel on different data shards with infrequent synchronization. It leverages low-precision storage for main weights and efficient communication of only the LoRA parameters.	LTE achieves comparable performance to standard pre-training across various vision tasks and datasets. Infrequent merging of LoRA heads is crucial for performance, striking a balance between accuracy and communication cost. Parallel LTE heads explore diverse subspaces, contributing to its effectiveness.	LTE currently exhibits slower convergence in the final stages of training on ImageNet-1K compared to standard training. Further optimization is needed to determine the ideal number of ranks/heads and explore heterogeneous LoRA parameterization. Future work will focus on smarter merging strategies for improved efficiency with larger local steps.	model pre-training, low-rank adaptation, parallel training, low-memory devices, federated learning
2402.16806 Report	Multi-Human Mesh Recovery with Transformers	Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy	Conventional approaches to human mesh recovery predominantly employ a region-based strategy. This involves initially cropping out a human-centered region as a preprocessing step, with subsequent modeling focused on this zoomed-in image. While effective for single figures, this pipeline poses challenges when dealing with images featuring multiple individuals, as different people are processed separately, often leading to inaccuracies in relative positioning. Despite the advantages of adopting a whole-image-based approach to address this limitation, early efforts in this direction have fallen short in performance compared to recent region-based methods. In this work, we advocate for this under-explored area of modeling all people at once, emphasizing its potential for improved accuracy in multi-person scenarios through considering all individuals simultaneously and leveraging the overall context and interactions. We introduce a new model with a streamlined transformer-based design, featuring three critical design choices: multi-scale feature incorporation, focused attention mechanisms, and relative joint supervision. Our proposed model demonstrates a significant performance improvement, surpassing state-of-the-art region-based and whole-image-based methods on various benchmarks involving multiple individuals.	This paper introduces a novel whole-image-based human mesh recovery (HMR) method that addresses limitations of conventional region-based approaches in multi-person scenarios.	Region-based HMR methods, while effective for single figures, struggle to accurately capture relative positioning in multi-person images due to independent processing. Whole-image-based methods offer a solution by processing all individuals simultaneously, but have lagged behind in performance.	The proposed model employs a streamlined transformer-based architecture with multi-scale feature incorporation, focused attention mechanisms using deformable attention, and a novel relative joint loss function to supervise relative joint locations.	The method outperforms state-of-the-art region-based and whole-image-based methods on multiple multi-person benchmarks (CHI3D, Hi4D, BEDLAM). Significant improvements are observed in the joint PA-MPJPE metric, highlighting its superior ability to model relative human positions. Ablation studies confirm the importance of multi-scale features, focused attention, and relative joint loss for achieving superior performance.	The method, while showing promise, still faces limitations in handling mesh penetration during close interactions, a common challenge for regression-based methods. Future work could explore incorporating contact optimization strategies to further enhance the fidelity of reconstructions in such scenarios.	human mesh recovery, multi-person pose estimation, whole-image-based modeling, transformer networks, deformable attention
2402.16641 Report	Towards Open-ended Visual Quality Comparison	Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, Weisi Lin	Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V "teacher" responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not only achieves in average 30% higher accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing related benchmarks and the proposed MICBench. Our model is published at https://huggingface.co/q-future/co-instruct.	This paper introduces Co-Instruct, a novel instruction-tuning dataset designed for open-ended visual quality comparison, along with Co-Instruct-Comparer, an LMM model trained on this dataset, and MICBench, a benchmark for evaluating multi-image quality comparison in LMMs.	This work addresses the limitations of existing IQA methods that struggle with open-ended questions and detailed reasoning, particularly in comparative settings. It leverages the strengths of LMMs to provide more human-like and informative quality assessments.	The authors construct the Co-Instruct-562K dataset by merging single-image quality descriptions using LLMs (Merge2Compare) and leveraging GPT-4V responses on unlabeled images (Teach2Compare). They propose Co-Instruct-Comparer, an LMM with reduced visual tokens and an image-text interleaved format, trained on Co-Instruct-562K. Finally, they introduce MICBench, a benchmark with 2,000 MCQs, to evaluate multi-image quality comparison.	Co-Instruct-Comparer surpasses all existing LMMs, including GPT-4V, on various visual quality comparison benchmarks. The model achieves human-level accuracy on Q-BenchPAIR-A1 and even outperforms non-expert humans on specific settings. Analysis reveals that Co-Instruct-Comparer's detailed reasoning capabilities match GPT-4V while significantly exceeding other LMMs.	GPT evaluation, used in Q-BenchPAIR-A2, might be biased towards longer text outputs, potentially underestimating Co-Instruct-Comparer's performance. Future research could explore better evaluation metrics and datasets for fine-grained comparisons, particularly for highly similar image pairs.	large multi-modality models (lmms), visual quality assessment, visual quality comparison, visual question answering, benchmarking
2402.16627 Report	Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing	Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, Bin Cui	Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at https://github.com/YangLing0818/ContextDiff	This paper introduces ContextDiff, a novel conditional diffusion model designed to enhance text-guided visual generation and editing by incorporating cross-modal context into both forward and reverse processes.	Existing text-guided visual diffusion models primarily incorporate text-visual relationships only in the reverse process, limiting their ability to precisely convey textual semantics in generated visuals. ContextDiff aims to address this limitation by leveraging cross-modal context for improved semantic alignment.	ContextDiff utilizes a relational network (e.g., cross-attention) to model cross-modal interactions between text and visual data. This context is then propagated to all timesteps of both the forward and reverse diffusion processes, acting as a context-aware trajectory adapter. The method is generalized and theoretically derived for both DDPMs and DDIMs, benefiting both cross-modal generation and editing tasks.	ContextDiff achieves state-of-the-art performance in text-to-image generation, outperforming dominant diffusion models like Stable Diffusion, DALL-E 2, and Imagen. In text-to-video editing, ContextDiff surpasses existing methods in textual alignment and temporal consistency, as evidenced by quantitative metrics and user studies. The context-aware adapter in ContextDiff generalizes well to other text-guided video diffusion models, consistently improving their generation quality.	The theoretical analysis focuses on optimal estimation and doesn't fully address convergence behavior due to the complexity of neural network optimization. Future work could explore incorporating more sophisticated relational networks or alternative cross-modal interaction modeling techniques.	diffusion models, text-to-image generation, text-to-video editing, cross-modal learning, contextualized diffusion
2402.16607 Report	GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos	Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang	In this paper, we present a novel method that facilitates the creation of vivid 3D Gaussian avatars from monocular video inputs (GVA). Our innovation lies in addressing the intricate challenges of delivering high-fidelity human body reconstructions and aligning 3D Gaussians with human skin surfaces accurately. The key contributions of this paper are twofold. Firstly, we introduce a pose refinement technique to improve hand and foot pose accuracy by aligning normal maps and silhouettes. Precise pose is crucial for correct shape and appearance reconstruction. Secondly, we address the problems of unbalanced aggregation and initialization bias that previously diminished the quality of 3D Gaussian avatars, through a novel surface-guided re-initialization method that ensures accurate alignment of 3D Gaussian points with avatar surfaces. Experimental results demonstrate that our proposed method achieves high-fidelity and vivid 3D Gaussian avatar reconstruction. Extensive experimental analyses validate the performance qualitatively and quantitatively, demonstrating that it achieves state-of-the-art performance in photo-realistic novel view synthesis while offering fine-grained control over the human body and hand pose. Project page: https://3d-aigc.github.io/GVA/.	This paper proposes GVA, a novel method to reconstruct high-fidelity, hand-controllable 3D Gaussian avatars from monocular videos.	Existing methods struggle with accurate hand and foot pose estimation, leading to limitations in avatar expressiveness and controllability, especially for hand movements.	The method uses a two-stage pose refinement technique aligning normal maps and silhouettes for accurate pose estimation. It then introduces a surface-guided re-initialization mechanism to address unbalanced aggregation and initialization bias in 3D Gaussian point distribution.	The method achieves high-fidelity avatar reconstruction with detailed hand movements, as demonstrated on the ZJU-MoCap, People-Snapshot, and GVA-Snapshot datasets. It outperforms existing state-of-the-art methods in both qualitative and quantitative evaluations, showing better accuracy in shape, appearance, and perceptual quality. The ablation study confirms the effectiveness of each proposed component, including pose refinement and surface-guided re-initialization.	The method currently lacks facial expression control and struggles with very loose clothing. Future work will explore incorporating learnable blendshapes for facial expressions and physics-based deformation priors for handling loose garments.	3d gaussian avatar, monocular reconstruction, pose refinement, surface-guided re-initialization, hand controllable
2402.16506 Report	Stochastic Conditional Diffusion Models for Semantic Image Synthesis	Juyeon Ko, Inho Kong, Dogyun Park, Hyunwoo J. Kim	Semantic image synthesis (SIS) is a task to generate realistic images corresponding to semantic maps (labels). It can be applied to diverse real-world practices such as photo editing or content creation. However, in real-world applications, SIS often encounters noisy user inputs. To address this, we propose Stochastic Conditional Diffusion Model (SCDM), which is a robust conditional diffusion model that features novel forward and generation processes tailored for SIS with noisy labels. It enhances robustness by stochastically perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion. Through the diffusion of labels, the noisy and clean semantic maps become similar as the timestep increases, eventually becoming identical at $t=T$. This facilitates the generation of an image close to a clean image, enabling robust generation. Furthermore, we propose a class-wise noise schedule to differentially diffuse the labels depending on the class. We demonstrate that the proposed method generates high-quality samples through extensive experiments and analyses on benchmark datasets, including a novel experimental setup simulating human errors during real-world applications.	This paper proposes Stochastic Conditional Diffusion Model (SCDM), a robust conditional diffusion model for semantic image synthesis (SIS) that can handle noisy user inputs (labels).	SIS is important for real-world applications like photo editing, but user-provided labels are often noisy, leading to poor image generation. SCDM addresses this challenge by improving robustness to noisy labels.	SCDM introduces 'Label Diffusion,' a discrete diffusion process that stochastically perturbs semantic labels during training. This makes the model robust to discrepancies between clean training labels and noisy user-provided labels during inference. Additionally, SCDM uses a class-wise noise schedule to preserve information for small and rare objects.	SCDM outperforms existing GAN-based and diffusion-based SIS models on noisy label benchmarks, showing significant FID improvements. SCDM demonstrates strong performance even with highly corrupted labels, generating images similar to those produced with clean labels. The proposed class-wise noise schedule significantly improves the generation quality of small and rare objects.	The reliance on pre-trained segmentation models for mIoU evaluation might not perfectly reflect the true semantic correspondence. Exploring different noise schedule hyperparameters could further improve performance.	diffusion models, semantic image synthesis, conditional generation, robustness, noisy labels
2402.16421 Report	Outline-Guided Object Inpainting with Diffusion Models	Markus Pobitzer, Filip Janicki, Mattia Rigotti, Cristiano Malossi	Instance segmentation datasets play a crucial role in training accurate and robust computer vision models. However, obtaining accurate mask annotations to produce high-quality segmentation datasets is a costly and labor-intensive process. In this work, we show how this issue can be mitigated by starting with small annotated instance segmentation datasets and augmenting them to effectively obtain a sizeable annotated dataset. We achieve that by creating variations of the available annotated object instances in a way that preserves the provided mask annotations, thereby resulting in new image-mask pairs to be added to the set of annotated images. Specifically, we generate new images using a diffusion-based inpainting model to fill out the masked area with a desired object class by guiding the diffusion through the object outline. We show that the object outline provides a simple, but also reliable and convenient training-free guidance signal for the underlying inpainting model that is often sufficient to fill out the mask with an object of the correct class without further text guidance and preserve the correspondence between generated images and the mask annotations with high precision. Our experimental results reveal that our method successfully generates realistic variations of object instances, preserving their shape characteristics while introducing diversity within the augmented area. We also show that the proposed method can naturally be combined with text guidance and other image augmentation techniques.	This paper presents a novel data augmentation method for instance segmentation datasets leveraging diffusion-based inpainting guided by object outlines.	Annotating instance segmentation datasets is expensive and time-consuming. This method offers a way to augment existing datasets and potentially improve model performance.	The method erodes object masks to create outlines, then uses a diffusion-based inpainting model (Stable Diffusion) to generate variations of the original object within the outline, optionally guided by text prompts (object class).	Augmenting a few-shot instance segmentation dataset with generated images improved segmentation average precision (AP). The method achieved state-of-the-art Fréchet Inception Distance (FID) scores, indicating high-quality generated images. Object outlines proved to be a strong guidance signal for the inpainting model, enabling realistic and diverse object variations.	The method can fail when objects are severely occluded or out-of-distribution. Further research is needed to understand the complex relationship between scene context, object outline, and generated image quality.	image inpainting, data augmentation, instance segmentation, diffusion models, stable diffusion
2402.16370 Report	DEYO: DETR with YOLO for End-to-End Object Detection	Haodong Ouyang	The training paradigm of DETRs is heavily contingent upon pre-training their backbone on the ImageNet dataset. However, the limited supervisory signals provided by the image classification task and one-to-one matching strategy result in an inadequately pre-trained neck for DETRs. Additionally, the instability of matching in the early stages of training engenders inconsistencies in the optimization objectives of DETRs. To address these issues, we have devised an innovative training methodology termed step-by-step training. Specifically, in the first stage of training, we employ a classic detector, pre-trained with a one-to-many matching strategy, to initialize the backbone and neck of the end-to-end detector. In the second stage of training, we froze the backbone and neck of the end-to-end detector, necessitating the training of the decoder from scratch. Through the application of step-by-step training, we have introduced the first real-time end-to-end object detection model that utilizes a purely convolutional structure encoder, DETR with YOLO (DEYO). Without reliance on any supplementary training data, DEYO surpasses all existing real-time object detectors in both speed and accuracy. Moreover, the comprehensive DEYO series can complete its second-phase training on the COCO dataset using a single 8GB RTX 4060 GPU, significantly reducing the training expenditure. Source code and pre-trained models are available at https://github.com/ouyanghaodong/DEYO.	The paper introduces DEYO, the first real-time end-to-end object detector that uses a purely convolutional encoder, and a novel step-by-step training method for DETRs that eliminates the need for pre-training on additional datasets like ImageNet.	Existing DETR models rely on pre-training on datasets like ImageNet, limiting flexibility and increasing development costs. Additionally, limited supervisory signals in DETR training lead to inadequately pre-trained necks and unstable optimization.	The step-by-step training method involves two stages: 1) Pre-train a classic detector (YOLOv8) with a one-to-many matching strategy to initialize the backbone and neck. 2) Freeze the backbone and neck and train the decoder from scratch. DEYO utilizes this method and incorporates a YOLO backbone and neck with a lightweight convolutional encoder and a Transformer-based decoder.	DEYO surpasses state-of-the-art real-time object detectors in both speed and accuracy without any additional training data. The step-by-step training method significantly improves performance compared to conventional DETR training. DEYO demonstrates superior performance in dense scenarios, achieving 92.3 AP and 43.3 mMR on the CrowdHuman dataset.	The neck of YOLOv8 and the model scaling strategy are not fully optimized for DEYO, leading to diminishing performance gains with increasing model size. The mismatch between the output dimensions of YOLOv8's neck and the hidden dimensions of DEYO's decoder needs further investigation.	object detection, detr, yolo, step-by-step training, real-time
2402.16366 Report	SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field	Zetian Song, Wenhong Duan, Yuhuai Zhang, Shiqi Wang, Siwei Ma, Wen Gao	Representing the Neural Radiance Field (NeRF) with the explicit voxel grid (EVG) is a promising direction for improving NeRFs. However, the EVG representation is not efficient for storage and transmission because of the terrific memory cost. Current methods for compressing EVG mainly inherit the methods designed for neural network compression, such as pruning and quantization, which do not take full advantage of the spatial correlation of voxels. Inspired by prosperous digital image compression techniques, this paper proposes SPC-NeRF, a novel framework applying spatial predictive coding in EVG compression. The proposed framework can remove spatial redundancy efficiently for better compression performance.Moreover, we model the bitrate and design a novel form of the loss function, where we can jointly optimize compression ratio and distortion to achieve higher coding efficiency. Extensive experiments demonstrate that our method can achieve 32% bit saving compared to the state-of-the-art method VQRF on multiple representative test datasets, with comparable training time.	SPC-NeRF, a novel framework for compressing Explicit Voxel Grid (EVG) represented Neural Radiance Fields (NeRFs) using spatial predictive coding, leading to significant bitrate reduction without substantial quality loss.	EVG NeRFs offer fast training and rendering but have large memory footprints, hindering storage and transmission. Existing compression methods often ignore spatial correlation among voxels.	1) Importance pruning and identification of critical voxels. 2) Construction of a reference graph for spatial prediction. 3) Scalar quantization and prediction on the feature grid. 4) Joint rate-distortion optimization during finetuning with entropy modeling. 5) Two-step finetuning with coarse and fine quantization for critical voxels.	SPC-NeRF achieves 32% bit saving compared to VQRF on the Synthetic-NeRF dataset. The method demonstrates over 100x compression on uncompressed DVGO with negligible quality degradation. SPC-NeRF generates a smooth, approximate logarithmic rate-distortion curve by adjusting a single trade-off factor.	The current implementation only uses complete voxels for prediction, limiting potential compression gains. Future work can explore more efficient entropy coding and block-based prediction modes to further reduce bitrate.	neural radiance fields, nerf compression, explicit voxel grid, spatial predictive coding, rate-distortion optimization
2402.16359 Report	Feedback Efficient Online Fine-Tuning of Diffusion Models	Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine, Tommaso Biancalani	Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property. Even with access to online queries of the ground-truth reward function, efficiently discovering high-reward samples can be challenging: they might have a low probability in the initial distribution, and there might be many infeasible samples that do not even have a well-defined reward (e.g., unnatural images or physically impossible molecules). In this work, we propose a novel reinforcement learning procedure that efficiently explores on the manifold of feasible samples. We present a theoretical analysis providing a regret guarantee, as well as empirical validation across three domains: images, biological sequences, and molecules.	This paper proposes SEIKO, a feedback-efficient online fine-tuning approach for diffusion models, tailored for scenarios where querying the reward function is expensive.	Fine-tuning diffusion models with RL often requires numerous expensive queries to the true reward function. SEIKO minimizes this by efficiently exploring the space of valid samples to quickly discover high-reward designs.	SEIKO interleaves reward learning and diffusion model updates. It leverages KL regularization to preserve information from a pre-trained diffusion model, ensuring exploration within the manifold of feasible designs. Additionally, it employs an uncertainty model to guide exploration towards novel, potentially high-reward regions.	SEIKO demonstrates superior feedback efficiency compared to non-adaptive baselines and naive online fine-tuning methods, highlighting the importance of adaptive data collection and KL regularization. Empirical validation across image generation (aesthetic quality), protein sequence design (fluorescence), and molecule generation (QED score) confirms SEIKO's ability to efficiently discover high-reward designs. Theoretical analysis provides a regret guarantee for SEIKO, demonstrating its provable feedback efficiency.	The current work focuses on Markovian diffusion models. Extending the approach to non-Markovian settings could be explored. Future research can investigate the application of SEIKO to diffusion models specifically designed for biological or chemical applications, such as those generating molecular graphs or protein structures.	diffusion models, reinforcement learning, online learning, feedback efficiency, generative models
2402.16124 Report	AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation	Yasheng Sun, Wenqing Chu, Hang Zhou, Kaisiyuan Wang, Hideki Koike	While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with the speaker's speaking status remains challenging. Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns with the speaking status. In this paper, we propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking face generation. This system harnesses the robust contextual reasoning and hallucination capability offered by Large Language Models (LLMs) to instruct the realistic synthesis of 3D talking faces. Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information and generating instructions implying expressive facial details seamlessly corresponding to the speech. Subsequently, a diffusion-based generative network executes these instructions. This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions and specify desired operations or modifications. Extensive experiments showcase the effectiveness of our approach in producing vivid talking faces with expressive facial movements and consistent emotional status.	Presents AVI-Talking, an Audio-Visual Instruction system for generating expressive 3D talking faces by leveraging the inherent style information in human speech.	Addresses the challenge of incorporating expressive facial details aligned with speaking status in 3D talking face generation, which previous methods struggle to achieve.	A two-stage strategy is employed: 1) LLMs comprehend audio and generate instructions for expressive facial details, 2) A diffusion-based network synthesizes talking faces following these instructions.	Generates vivid 3D talking faces with expressive facial movements and consistent emotional status. Outperforms previous state-of-the-art methods in subjective user studies on aspects of lip sync quality, movement expressiveness, and expression consistency. Demonstrates the ability to generate diverse facial expressions for a given speech input and handle out-of-distribution instructions to some extent.	Model's performance depends on the quality and diversity of the training dataset, potentially leading to insensitivity to certain speaking styles. Effectiveness of instruction following is limited to instructions similar to the dataset distribution.	talking face generation, 3d facial animation, audio-visual instruction, large language models (llms), diffusion models
2402.16013 Report	Semi-supervised Open-World Object Detection	Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal	Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this formulation less realistic in a real-world deployment. To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner. We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data. We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information. We demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach. Our source code, models and splits are available here - https://github.com/sahalshajim/SS-OWFormer	Introduces Semi-supervised Open-World Object Detection (SS-OWOD) setting and SS-OWFormer, a novel transformer-based detector, to reduce annotation reliance in open-world object detection.	Existing OWOD methods depend heavily on human oracles for labeling unknown objects, which is costly and impractical in real-world applications.	SS-OWFormer utilizes a feature alignment scheme to align object queries between original and augmented images, leveraging unlabeled data. It also employs an object query-guided pseudo-labeling scheme for improved unknown object detection.	SS-OWFormer with 10% labeled data outperforms state-of-the-art OW-DETR with 50% labeled data on COCO. SS-OWFormer achieves a 4.8% absolute gain in unknown recall over OW-DETR. Demonstrated effectiveness of SS-OWOD and SS-OWFormer on remote sensing object detection with curated splits and baseline evaluations.	SS-OWFormer lacks an explicit mechanism for forgetting previously seen categories. Performance can be further improved for challenging scenarios in satellite imagery with overlapping objects.	open-world object detection, semi-supervised learning, transformer, pseudo-labeling, remote sensing
2402.15870 Report	Spec-Gaussian: Anisotropic View-Dependent Appearance for 3D Gaussian Splatting	Ziyi Yang, Xinyu Gao, Yangtian Sun, Yihua Huang, Xiaoyang Lyu, Wen Zhou, Shaohui Jiao, Xiaojuan Qi, Xiaogang Jin	The recent advancements in 3D Gaussian splatting (3D-GS) have not only facilitated real-time rendering through modern GPU rasterization pipelines but have also attained state-of-the-art rendering quality. Nevertheless, despite its exceptional rendering quality and performance on standard datasets, 3D-GS frequently encounters difficulties in accurately modeling specular and anisotropic components. This issue stems from the limited ability of spherical harmonics (SH) to represent high-frequency information. To overcome this challenge, we introduce Spec-Gaussian, an approach that utilizes an anisotropic spherical Gaussian (ASG) appearance field instead of SH for modeling the view-dependent appearance of each 3D Gaussian. Additionally, we have developed a coarse-to-fine training strategy to improve learning efficiency and eliminate floaters caused by overfitting in real-world scenes. Our experimental results demonstrate that our method surpasses existing approaches in terms of rendering quality. Thanks to ASG, we have significantly improved the ability of 3D-GS to model scenes with specular and anisotropic components without increasing the number of 3D Gaussians. This improvement extends the applicability of 3D GS to handle intricate scenarios with specular and anisotropic surfaces.	Introduced "Spec-Gaussian", a novel 3D Gaussian splitting approach featuring an anisotropic view-dependent appearance using an ASG appearance field, and a coarse-to-fine training mechanism to eliminate floaters in rendered scenes.	Addresses limitations of 3D Gaussian Splatting (3D-GS) in modeling specular and anisotropic components, which are common in real-world scenes and crucial for photorealistic rendering.	Replaces spherical harmonics in 3D-GS with an anisotropic spherical Gaussian (ASG) appearance field to model high-frequency information. Employs a hybrid approach with anchor Gaussians to reduce computational and storage overhead. Introduces a coarse-to-fine training scheme to learn global information and reduce overfitting, minimizing floaters.	Achieves state-of-the-art rendering quality on multiple benchmarks, including NeRF, NSVF, and anisotropic scenes. Significantly improves 3D-GS's ability to model complex specular reflections and anisotropic materials, exceeding NeRF-based methods in some cases. Maintains fast rendering speeds comparable to other 3D-GS-based methods while improving visual quality.	Faces challenges in handling reflections due to the lack of explicit geometry in 3D-GS. Reliance on ground truth geometry for better reflection modeling can lead to a decline in overall rendering quality.	3d gaussian splatting, neural rendering, anisotropy, specular highlights, real-time rendering
2402.15784 Report	IRConStyle: Image Restoration Framework Using Contrastive Learning and Style Transfer	Dongqi Fan, Xin Zhao, Liang Chang	Recently, the contrastive learning paradigm has achieved remarkable success in high-level tasks such as classification, detection, and segmentation. However, contrastive learning applied in low-level tasks, like image restoration, is limited, and its effectiveness is uncertain. This raises a question: Why does the contrastive learning paradigm not yield satisfactory results in image restoration? In this paper, we conduct in-depth analyses and propose three guidelines to address the above question. In addition, inspired by style transfer and based on contrastive learning, we propose a novel module for image restoration called \textbf{ConStyle}, which can be efficiently integrated into any U-Net structure network. By leveraging the flexibility of ConStyle, we develop a \textbf{general restoration network} for image restoration. ConStyle and the general restoration network together form an image restoration framework, namely \textbf{IRConStyle}. To demonstrate the capability and compatibility of ConStyle, we replace the general restoration network with transformer-based, CNN-based, and MLP-based networks, respectively. We perform extensive experiments on various image restoration tasks, including denoising, deblurring, deraining, and dehazing. The results on 19 benchmarks demonstrate that ConStyle can be integrated with any U-Net-based network and significantly enhance performance. For instance, ConStyle NAFNet significantly outperforms the original NAFNet on SOTS outdoor (dehazing) and Rain100H (deraining) datasets, with PSNR improvements of 4.16 dB and 3.58 dB with 85% fewer parameters.	This paper analyzes the limitations of contrastive learning (CL) in image restoration (IR) and proposes a novel plug-and-play module called ConStyle, integrated into a general IR framework (IRConStyle) to enhance IR performance.	Contrastive learning, highly successful in high-level vision tasks, has shown limited effectiveness in low-level tasks like IR. This paper addresses this gap by analyzing the reasons behind this limitation and proposing a novel framework to leverage CL for improved IR.	The paper proposes three guidelines for enhancing CL in IR: using additional data structures for storing samples, utilizing encoder's latent features, and adopting a suitable pretext task. It introduces ConStyle, a module inspired by style transfer, and integrates it into a general U-Net based restoration network (IRConStyle). Experiments are conducted by replacing the restoration network with transformer-based, CNN-based, and MLP-based networks for various IR tasks.	ConStyle significantly improves the performance of existing IR models on various benchmarks, including denoising, deblurring, dehazing, and deraining. ConStyle NAFNet, for instance, achieves significant PSNR improvements over the original NAFNet on SOTS outdoor (dehazing) and Rain100H (deraining) datasets, with 4.16 dB and 3.58 dB improvements respectively, while using 85% fewer parameters. Ablation studies demonstrate the effectiveness of the proposed guidelines and the individual components of ConStyle.	The computational complexity increase caused by replacing different upsampling and downsampling methods within the restoration network. Exploring other pretext tasks for better utilization of CL in the IR domain.	image restoration, contrastive learning, style transfer, deep learning, computer vision
2402.15648 Report	MambaIR: A Simple Baseline for Image Restoration with State-Space Model	Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, Shu-Tao Xia	Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field. Code is available at \url{https://github.com/csguoh/MambaIR}.	This paper introduces MambaIR, a novel image restoration model based on the Mamba state-space model, aiming to address the trade-off between computational efficiency and global receptive fields in existing methods.	Current image restoration methods, employing CNNs or Transformers, struggle to simultaneously achieve global receptive fields for high-quality reconstruction and efficient computation for practical application. MambaIR, leveraging the strengths of the Mamba model, provides a solution to overcome this limitation.	MambaIR consists of three stages: shallow feature extraction, deep feature extraction using stacked Residual State Space Blocks (RSSBs), and high-quality image reconstruction. RSSB, as the core component, incorporates local convolution to mitigate local pixel forgetting and channel attention to reduce channel redundancy in the standard Mamba model.	MambaIR consistently outperforms SwinIR, a state-of-the-art Transformer-based method, on various image super-resolution benchmarks, achieving up to 0.45dB PSNR improvement with similar computational cost. The ablation study validates the effectiveness of local enhancement and channel attention in RSSB, highlighting their contribution to MambaIR's superior performance. MambaIR exhibits strong performance on image denoising tasks, both for synthetic Gaussian noise and real-world noise, demonstrating its robustness and generalization ability.	The current implementation of MambaIR primarily focuses on single-image restoration tasks, and extending it to video restoration could be a potential future direction. Further exploration of more efficient and effective unfolding strategies in the 2D Selective Scan Module (2D-SSM) could further enhance MambaIR's performance.	image restoration, state space model, mamba, global receptive field, efficient computation
2402.15555 Report	Deep Networks Always Grok and Here is Why	Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk	Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on a new measure of the local complexity of a DNN's input-output mapping. Our local complexity measures the density of the so-called 'linear regions' (aka, spline partition regions) that tile the DNN input space, and serves as a utile progress measure for training. We provide the first evidence that for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space emerges thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial	This paper demonstrates that grokking, a phenomenon where deep neural networks (DNNs) exhibit delayed generalization, is more widespread than previously thought and occurs in various practical settings. The paper also introduces the concept of 'delayed robustness,' where DNNs achieve robustness to adversarial examples long after generalization.	Understanding grokking is crucial as it challenges the conventional understanding of DNN training and generalization. This work provides a novel perspective on grokking by linking it to the dynamics of DNNs' input space partitioning during training.	The authors leverage the interpretation of DNNs as continuous piecewise affine spline operators. They introduce 'local complexity,' a new progress measure that quantifies the density of linear regions in the DNN's input space partition. By analyzing the evolution of local complexity throughout training, the authors reveal a consistent pattern leading to grokking.	DNNs exhibit 'delayed robustness,' achieving robustness to adversarial examples long after generalization occurs. Local complexity, a measure of non-linearity density in the DNN's input space, follows a double descent pattern during training, with grokking occurring during the final descent phase. During the final descent phase, the DNN's linear regions migrate away from training data points and concentrate around the decision boundary, forming a 'robust partition'.	The study primarily relies on empirical analysis, lacking a complete theoretical justification for the observed double descent behavior in local complexity. Future work could explore the connection between region migration and other phenomena like neural collapse, as well as investigate the impact of different optimizers and sharpness-aware minimization techniques on the training dynamics.	grokking, delayed generalization, adversarial robustness, deep neural networks, spline theory
2402.15504 Report	Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition	Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen	Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts -- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms.	This paper introduces Gen4Gen, a semi-automated pipeline for creating personalized image datasets with complex multi-concept compositions and detailed text descriptions, named MyCanvas. MyCanvas, along with a novel evaluation metric, addresses limitations in existing personalized text-to-image generation methods and benchmarks.	Existing personalization techniques struggle with multiple concepts, particularly when semantically similar, due to limitations in pre-training datasets like LAION. Existing benchmarks also lack a holistic approach to evaluate multi-concept personalization.	Gen4Gen leverages object detectors, LLMs, inpainting models, and MLLMs to compose user-provided concept images into new scenes with aligned descriptions. It utilizes prompt engineering to enhance training and proposes a novel metric combining Composition-Personalization-CLIP (CP-CLIP) and Text-Image Alignment CLIP (TI-CLIP) scores.	MyCanvas significantly improves multi-concept personalization performance in models like Custom Diffusion and DreamBooth. Proposed prompting strategies further enhance generation quality, particularly in complex compositions. The study highlights the importance of high-quality, well-aligned datasets for personalized image generation.	Gen4Gen's reliance on LLMs and diffusion inpainting can sometimes lead to unrealistic compositions or artifacts, requiring manual filtering. Future work could explore automating the filtering process and leveraging richer multi-modal understanding in MLLMs for better composition guidance.	text-to-image generation, personalization, dataset creation, multi-concept composition, evaluation metric
2402.15429 Report	ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation	Yi Zhang, Yun Tang, Wenjie Ruan, Xiaowei Huang, Siddartha Khastgir, Paul Jennings, Xingyu Zhao	Text-to-Image (T2I) Diffusion Models (DMs) have shown impressive abilities in generating high-quality images based on simple text descriptions. However, as is common with many Deep Learning (DL) models, DMs are subject to a lack of robustness. While there are attempts to evaluate the robustness of T2I DMs as a binary or worst-case problem, they cannot answer how robust in general the model is whenever an adversarial example (AE) can be found. In this study, we first introduce a probabilistic notion of T2I DMs' robustness; and then establish an efficient framework, ProTIP, to evaluate it with statistical guarantees. The main challenges stem from: i) the high computational cost of the generation process; and ii) determining if a perturbed input is an AE involves comparing two output distributions, which is fundamentally harder compared to other DL tasks like classification where an AE is identified upon misprediction of labels. To tackle the challenges, we employ sequential analysis with efficacy and futility early stopping rules in the statistical testing for identifying AEs, and adaptive concentration inequalities to dynamically determine the "just-right" number of stochastic perturbations whenever the verification target is met. Empirical experiments validate the effectiveness and efficiency of ProTIP over common T2I DMs. Finally, we demonstrate an application of ProTIP to rank commonly used defence methods.	This paper introduces ProTIP, the first probabilistic robustness verification framework for text-to-image diffusion models against stochastic perturbations.	Existing robustness evaluations of these models are binary or worst-case, failing to quantify overall robustness and posing scalability issues for large models.	ProTIP employs sequential analysis with early stopping rules for efficient identification of adversarial examples, and adaptive concentration inequalities to dynamically determine the necessary number of perturbations.	ProTIP accurately estimates probabilistic robustness, converging to the approximated ground truth with sufficient perturbations. Early stopping rules significantly reduce computation by up to 4 times compared to fixed-sample methods. Adaptive sample sizing in ProTIP proves more efficient than using a predetermined sample size with Hoeffding's inequality.	Ground truth robustness is approximated due to the infeasibility of exhaustive testing. Exploration of more sophisticated text perturbation methods beyond character-level is left for future work.	diffusion models, probabilistic robustness, safe ai, text-to-image generation, adversarial examples
2402.15194 Report	Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control	Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, Sergey Levine	Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth "genuine" reward, as is the case in many practical applications. These challenges, collectively termed "reward collapse," pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.	This paper introduces ELEGANT, a novel method for fine-tuning diffusion models using entropy-regularized control, addressing limitations of existing techniques.	Fine-tuning diffusion models with reward functions often leads to reward collapse, sacrificing sample diversity and quality due to over-optimization of imperfect reward signals.	ELEGANT frames fine-tuning as entropy-regularized control against a pre-trained diffusion model, learning both the drift term and initial distribution using neural SDEs to generate samples from a target distribution balancing reward maximization and proximity to the original data.	ELEGANT effectively mitigates reward collapse, generating high-reward samples that are diverse and stay close to the training data distribution. Compared to KL-penalized RL fine-tuning, ELEGANT demonstrates superior performance in terms of reward, KL divergence, and diversity across image and biological sequence generation tasks. The paper provides theoretical results demonstrating the effectiveness of ELEGANT in targeting the desired distribution and maintaining bridges with the pre-trained diffusion model.	The effectiveness of ELEGANT relies on the accuracy of neural SDE solvers and the expressiveness of neural networks used for value function estimation. Future work includes exploring the application of ELEGANT to more specialized diffusion models in biology and chemistry.	diffusion models, fine-tuning, entropy regularization, stochastic control, reward collapse
2402.15120 Report	Fine-tuning CLIP Text Encoders with Two-step Paraphrasing	Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang	Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.	This paper introduces ParaCLIP, a fine-tuning approach using synthetic paraphrases to enhance the representation robustness of CLIP's text encoder against linguistic variations in input queries.	Current CLIP models struggle with linguistic variations like paraphrases, hindering their effectiveness in real-world applications with diverse user queries.	The method involves a two-step paraphrase generation process from web-scale image captions using LLMs. Then, CLIP's text encoder is fine-tuned with these paraphrases while freezing the image encoder.	ParaCLIP significantly outperforms baseline CLIP models on tasks like paraphrased retrieval, Visual Genome Relation and Attribution, and semantic textual similarity. The approach demonstrates the effectiveness of leveraging synthetic paraphrases for improving CLIP's robustness to linguistic variations. ParaCLIP maintains competitive performance on standard tasks like zero-shot image classification and text/image retrieval.	The method can sometimes degrade performance on certain vision and vision-language tasks, potentially due to the sensitivity of the InfoNCE loss to batch size variations. Future work includes investigating factors contributing to performance degradation and exploring the approach's potential for addressing limitations in compositional understanding.	clip, paraphrase, fine-tuning, vision-language model, text-to-image retrieval
2402.14797 Report	Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis	Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov	Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.	Introduces \methodname{}, a scalable, video-first text-to-video generation model that leverages a compressed video representation and joint spatiotemporal modeling to achieve state-of-the-art performance in terms of generation quality, temporal consistency, and motion complexity.	Existing text-to-video generation models, often adapted from image models, struggle with motion fidelity, scalability, and maintaining visual quality in videos. This work addresses these limitations by proposing a video-centric approach.	The authors propose a modified EDM diffusion framework tailored for high-resolution videos and introduce a scalable transformer-based architecture inspired by FITs, which learns a compressed video representation for efficient joint spatiotemporal modeling. They train their model on a large internal dataset of images and videos.	The proposed FIT-based architecture trains 3.31 times faster than U-Nets and performs inference 4.49 times faster, while achieving better generation quality. \methodname{} outperforms previous state-of-the-art models on UCF101 and MSR-VTT benchmarks, particularly in metrics evaluating motion quality. User studies show a strong preference for \methodname{} over other state-of-the-art methods in terms of photorealism, text alignment, and motion quality.	The model exhibits limitations in text rendering accuracy, object count control, complex positional understanding, and handling negations in prompts. Further research can explore higher resolution generation, improved text rendering, and mitigating potential biases present in the training data.	text-to-video generation, diffusion models, transformer, compressed video representation, joint spatiotemporal modeling
2402.14792 Report	Consolidating Attention Features for Multi-view Image Editing	Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre	Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.	This paper introduces a novel method for consistent multi-view image editing, enabling significant articulations and shape changes in objects while preserving visual consistency across different views.	Existing multi-view editing methods often struggle with maintaining consistency when dealing with complex geometric changes, particularly in tasks involving significant shape modifications.	The method leverages a query feature space neural radiance field (QNeRF) trained on the internal query features of edited images generated by a diffusion model. QNeRF consolidates these queries, enhancing consistency during a progressive, iterative denoising process.	The proposed method achieves superior visual quality and multi-view consistency compared to alternative approaches like IN2N and TokenFlow. Evaluations based on KID and FID metrics demonstrate that the method retains higher fidelity to the original scene with fewer visual artifacts. User study results show a strong preference for the proposed method, indicating better alignment with the desired edits and higher visual quality in the generated 3D representations.	The method inherits limitations of text-to-image models, such as struggling with complex structures like hands and generating inconsistent fine details, particularly in high-frequency textures. The black-box optimization of QNeRF may lead to averaging outlier data, suggesting potential improvements through robust statistics techniques or alternative 3D representations like Gaussian Splats.	multi-view image editing, neural radiance fields (nerf), diffusion models, self-attention, 3d consistency
2402.14780 Report	Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models	Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava	Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. Our proposed method can be easily extended to various downstream tasks, including custom video generation and editing, video appearance customization, and multiple motion combination, in a plug-and-play fashion. Our project page can be found at https://anonymous-314.github.io.	This paper proposes Customize-A-Video, a novel one-shot motion customization method for videos. It leverages the motion learned from a single reference video and applies it to new subjects and scenes with both spatial and temporal variations.	Existing text-to-video generation models struggle with precise motion control, while video editing methods often lack temporal variability in motion transfer. This method addresses the need for one-shot motion customization with plausible variations.	The method utilizes Temporal LoRA (T-LoRA) applied to temporal attention layers of pre-trained T2V diffusion models. To disentangle spatial and temporal information, an 'Appearance Absorber' module is introduced. This module, trained on unordered video frames, detaches the original appearance from the reference video before motion learning.	Customize-A-Video successfully transfers motion from a single reference video to new subjects and scenes with variations in motion intensity, position, and camera view. The proposed T-LoRA effectively captures temporal motion dynamics, outperforming LoRA applications on non-temporal layers. Appearance Absorbers, such as Spatial LoRA (S-LoRA) and Textual Inversion, successfully decompose spatial information, leading to better motion modeling.	The standalone finetuning of spatial layers using appearance absorbers may lead to domain shift if overfitting occurs. The model may struggle to learn and transfer motions intrinsically tied to static poses, as these are primarily captured by appearance absorbers.	motion customization, text-to-video generation, diffusion models, temporal lora, appearance absorber
2402.14767 Report	DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models	Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang	We present DualFocus, a novel framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance. Current MLLMs typically singularly focus on inputs at a predefined resolution, resulting in deficiencies in detailed questions involving local regions. We introduced a DualFocus mechanism where the model concentrates on the image from a macro perspective, responses to the question, and identifies suitable sub-regions to zoom in for subsequent micro perspective analysis. Via the integration of answers from both macro and micro perspectives, the model is adept at addressing tasks that encompass global, detailed, and combined considerations. To endows the DualFocus mechanism in MLLMs, we curated a tailored dataset derived from the Visual Genome (VG) and adapted it to align with the training regimen of DualFocus. Through comparative studies across different model sizes and benchmarks, we demonstrate DualFocus's superiority in balancing detailed examination with holistic insight, significantly reducing hallucination instances in MLLMs and improving their performance in various vision-language tasks.	DualFocus, a novel framework that integrates macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance.	Current MLLMs struggle to balance detailed examination with holistic insight, often failing on questions requiring understanding of both global context and local details.	DualFocus first analyzes the entire image to grasp the macro context. It then identifies and zooms into important sub-regions for detailed examination, combining insights from both perspectives using Perplexity (PPL) for answer selection.	DualFocus consistently improves performance across different MLLM architectures (LLaVA, Qwen-VL) and various benchmarks (SEED, MMBench, GQA, TextVQA). It significantly enhances accuracy in tasks requiring detailed perception, like instance attributes and text understanding. DualFocus effectively mitigates hallucination in MLLMs, as demonstrated by improved performance on the POPE benchmark.	The current implementation relies on a two-stage training process, which could be streamlined in future work. The effectiveness of DualFocus across a broader range of visual-language tasks, such as image captioning, remains to be explored.	multi-modal learning, large language models, visual question answering, fine-grained visual recognition, hallucination mitigation
2402.14654 Report	Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot	Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, Thomas Lucas	We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and spatial location using a new cross-attention module called the Human Prediction Head (HPH), with one query per detected center token, attending to the entire set of features. As direct prediction of SMPL-X parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating this dataset into training further enhances predictions, particularly for hands, enabling us to achieve state-of-the-art performance. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously. We train models with various backbone sizes and input resolutions. In particular, using a ViT-S backbone and $448\times448$ input images already yields a fast and competitive model with respect to state-of-the-art methods, while considering larger models and higher resolutions further improve performance.	This paper introduces \Ours, the first single-shot method for multi-person whole-body human mesh recovery from a single RGB image, which accurately estimates expressive 3D meshes (body, face and hands) and 3D positions in the scene, optionally adapting to camera information.	Recovering whole-body human meshes from monocular images is important for various applications, including virtual/augmented reality, human-robot interaction, and human understanding from images and videos. Existing methods are limited to either single-person whole-body or multi-person body-only estimations.	\Ours employs a Vision Transformer (ViT) backbone to extract image features and uses a CenterNet-like framework for human detection at the patch level. A novel Human Perception Head (HPH), based on cross-attention, then predicts SMPL-X parameters and depth for each detected individual. Optionally, camera intrinsics can be incorporated via Fourier-encoded ray directions.	\Ours outperforms state-of-the-art methods in multi-person body-only mesh recovery, achieving significant gains on benchmarks like 3DPW, MuPoTs, CMU Panoptic, and AGORA. It achieves competitive performance in whole-body mesh recovery compared to single-person methods, demonstrating its ability to accurately estimate hand and facial poses alongside body pose. The model effectively leverages camera intrinsics for accurate 3D position estimation, outperforming previous approaches in human depth estimation on several benchmarks.	The patch-level detection may lead to collisions when multiple person-centers fall within the same patch, limiting detection accuracy in crowded scenes. The use of a relative rotation representation for the SMPL-X pose can lead to error accumulation, particularly in extreme body parts like hands and feet.	human mesh recovery, whole-body pose estimation, single-shot detection, vision transformer, cross-attention
2402.14650 Report	GaussianPro: 3D Gaussian Splatting with Progressive Propagation	Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, Xuejin Chen	The advent of 3D Gaussian Splatting (3DGS) has recently brought about a revolution in the field of neural rendering, facilitating high-quality renderings at real-time speed. However, 3DGS heavily depends on the initialized point cloud produced by Structure-from-Motion (SfM) techniques. When tackling with large-scale scenes that unavoidably contain texture-less surfaces, the SfM techniques always fail to produce enough points in these surfaces and cannot provide good initialization for 3DGS. As a result, 3DGS suffers from difficult optimization and low-quality renderings. In this paper, inspired by classical multi-view stereo (MVS) techniques, we propose GaussianPro, a novel method that applies a progressive propagation strategy to guide the densification of the 3D Gaussians. Compared to the simple split and clone strategies used in 3DGS, our method leverages the priors of the existing reconstructed geometries of the scene and patch matching techniques to produce new Gaussians with accurate positions and orientations. Experiments on both large-scale and small-scale scenes validate the effectiveness of our method, where our method significantly surpasses 3DGS on the Waymo dataset, exhibiting an improvement of 1.15dB in terms of PSNR.	GaussianPro, a novel progressive propagation strategy to guide Gaussian densification in 3D Gaussian Splatting (3DGS) for improved rendering quality and compactness, especially in texture-less regions.	3DGS relies on initialized point clouds from SfM, which often fails in texture-less regions, leading to difficulties in optimization and low-quality renderings.	The method utilizes a hybrid 3D-2D representation and iteratively propagates depth and normal information from neighboring pixels via patch matching. New Gaussians are initialized based on pixels with significant depth differences between rendered and propagated depth maps. Additionally, a planar loss is incorporated to regularize the geometry of Gaussians.	Significant improvement over 3DGS on the Waymo dataset, with a 1.15dB PSNR increase. Comparable results to state-of-the-art methods on the MipNeRF360 dataset, with improvements in weak-texture regions. Robustness against sparse training images, outperforming 3DGS with different training view ratios.	Lacks specific modeling for dynamic objects, leading to potential artifacts. Future work includes incorporating dynamic Gaussian techniques to handle dynamic objects.	3d gaussian splatting, neural rendering, novel view synthesis, progressive propagation, gaussian densification
2402.14586 Report	FrameNeRF: A Simple and Efficient Framework for Few-shot Novel View Synthesis	Yan Xing, Pan Wang, Ligang Liu, Daolun Li, Li Zhang	We present a novel framework, called FrameNeRF, designed to apply off-the-shelf fast high-fidelity NeRF models with fast training speed and high rendering quality for few-shot novel view synthesis tasks. The training stability of fast high-fidelity models is typically constrained to dense views, making them unsuitable for few-shot novel view synthesis tasks. To address this limitation, we utilize a regularization model as a data generator to produce dense views from sparse inputs, facilitating subsequent training of fast high-fidelity models. Since these dense views are pseudo ground truth generated by the regularization model, original sparse images are then used to fine-tune the fast high-fidelity model. This process helps the model learn realistic details and correct artifacts introduced in earlier stages. By leveraging an off-the-shelf regularization model and a fast high-fidelity model, our approach achieves state-of-the-art performance across various benchmark datasets.	FrameNeRF, a novel framework that leverages off-the-shelf regularization and fast high-fidelity NeRF models for few-shot novel view synthesis.	Fast high-fidelity NeRF models struggle with few-shot scenarios due to overfitting. This work introduces a framework to utilize their strengths in rendering quality and training speed for few-shot tasks.	Three stage training process: 1) Train a regularization model on sparse views and generate dense pseudo-ground-truth images. 2) Train a fast high-fidelity model on these dense views. 3) Fine-tune the high-fidelity model on the original sparse views.	Achieves state-of-the-art performance on Blender, LLFF, and DTU datasets for few-shot novel view synthesis. Demonstrates the effectiveness of the three-stage training process through ablation studies. Shows flexibility in choosing sub-modules and their impact on handling artifacts and reconstructing details.	The choice of sub-modules (regularization and high-fidelity models) impacts the performance and requires careful selection. The framework's reliance on existing models might limit its performance improvement compared to developing novel, specialized models.	novel view synthesis, neural radiance fields (nerf), few-shot learning, regularization, 3d reconstruction
2402.14577 Report	Debiasing Text-to-Image Diffusion Models	Ruifei He, Chuhui Xue, Haoru Tan, Wenqing Zhang, Yingchen Yu, Song Bai, Xiaojuan Qi	Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem setting and use the text descriptions of bias groups to establish an unsafe direction for guiding the diffusion process. Next, we simplify the problem into a weight optimization problem and attempt a Reinforcement solver, Policy Gradient, which shows sub-optimal performance with slow convergence. Further, to overcome limitations, we propose an iterative distribution alignment (IDA) method. Despite its simplicity, we show that IDA shows efficiency and fast convergence in resolving the social bias in TTI diffusion models. Our code will be released.	This paper proposes an iterative distribution alignment (IDA) method to resolve social bias (gender and ethnicity) in text-to-image diffusion models.	Current text-to-image models exhibit significant social biases, raising ethical concerns regarding the generation of millions of biased synthetic data.	The method utilizes text descriptions of bias groups to guide the diffusion process, iteratively adjusting weights assigned to these descriptions to achieve a balanced distribution in generated images.	IDA successfully reduces gender and ethnic bias in generated images, achieving a more balanced representation. The method demonstrates fast convergence, typically requiring only 1-3 iterations to achieve significant debiasing. IDA effectively mitigates gender bias across various occupations, even those with extreme initial biases.	The algorithm needs to be re-run for each new prompt, potentially limiting its practical application. While effective, the method lacks a formal explanation for its success, warranting further investigation.	text-to-image synthesis, diffusion models, social bias, debiasing, ethics in ai
2402.14401 Report	Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment	Zhaoyang Wang, Bo Hu, Mingyang Zhang, Jie Li, Leida Li, Maoguo Gong, Xinbo Gao	Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.	This paper proposes DiffV^2IQA, a novel NR-IQA model that leverages a diffusion model for image restoration and introduces two visual evaluation branches for enhanced quality assessment.	Existing NR-IQA methods struggle to balance pixel-level and high-level feature learning, particularly in authentic distortion scenarios. This work addresses these limitations by employing the intricate modeling capabilities of diffusion models.	The method employs a diffusion restoration network to generate an enhanced image and noise-containing images. Two branches then analyze this information: a visual compensation guidance branch (ViT-based with noise embedding) and a visual difference analysis branch (ResNet-based with a novel RTAB module).	DiffV^2IQA outperforms SOTA NR-IQA methods on several synthetic distortion datasets (LIVE, CSIQ, TID2013, Kadid10k). The model demonstrates strong generalization ability, achieving top performance in cross-database evaluations. Ablation studies validate the contribution of each component, highlighting the importance of the diffusion model, noise embedding, and the dual-branch evaluation strategy.	The pre-training requirement of the diffusion restoration network adds complexity and introduces dataset dependency. The iterative nature of the diffusion model increases inference time.	no-reference image quality assessment, diffusion model, transformer, visual compensation guidance, visual difference analysis
2402.14327 Report	Subobject-level Image Tokenization	Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung	Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Direct Segment Anything Model (DirectSAM) that efficiently produces comprehensive segmentation of subobjects, then embed subobjects into compact latent vectors and fed them into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models are open-sourced at https://github.com/ChenDelong1999/subobjects.	This paper introduces "subobject"-level image tokenization for vision-language learning, leveraging semantically meaningful image segments instead of fixed-size patches.	Current Transformer-based vision models rely on patch-level tokenization, ignoring semantic boundaries and leading to inefficient learning.	The authors propose DirectSAM for efficient subobject segmentation and a Sequence-to-sequence AutoEncoder (SeqAE) for embedding subobjects into compact vectors. These embeddings are then integrated into a Large Language Model (LLM) for vision-language tasks.	Subobject-level tokenization significantly accelerates vision-language learning compared to patch-level tokenization. Models with subobject tokenization achieve higher accuracy in object counting. Subobject-based models demonstrate superior performance in recognizing visual attributes like size, material, and shape.	The current implementation relies on synthetic datasets for evaluation. Exploration of different subobject segmentation methods and their impact on downstream tasks.	image tokenization, vision-language learning, subobject segmentation, large language models, segment anything model
2402.14316 Report	Place Anything into Any Video	Ziling Liu, Jinyu Yang, Mingqi Gao, Feng Zheng	Controllable video editing has demonstrated remarkable potential across diverse applications, particularly in scenarios where capturing or re-capturing real-world videos is either impractical or costly. This paper introduces a novel and efficient system named Place-Anything, which facilitates the insertion of any object into any video solely based on a picture or text description of the target object or element. The system comprises three modules: 3D generation, video reconstruction, and 3D target insertion. This integrated approach offers an efficient and effective solution for producing and editing high-quality videos by seamlessly inserting realistic objects. Through a user study, we demonstrate that our system can effortlessly place any object into any video using just a photograph of the object. Our demo video can be found at https://youtu.be/afXqgLLRnTE. Please also visit our project page https://place-anything.github.io to get access.	Introduces "Place-Anything," a novel system for inserting objects into any video using only a picture or text description, enabling easy video editing and creation without 3D modeling expertise.	Addresses the challenge of expensive and time-consuming video editing by enabling users to easily insert virtual objects into videos using simple inputs like photos or text descriptions, opening possibilities for various applications like product advertisements and VR experiences.	Uses a three-module approach: (1) 3D model generation from image/text using a diffusion-based Gaussian model; (2) Video reconstruction to estimate camera parameters and depth maps via optical flow and bundle adjustment; (3) 3D target insertion, projecting the selected region to 3D space and rendering the 3D model into the video.	Generates 3D models with high visual fidelity to input images or text. Accurately inserts 3D objects even in textureless regions by leveraging optical flow for precise tracking. Successfully infers camera parameters and seamlessly integrates 3D models into diverse video footage.	Current implementation requires user intervention to select object placement region. Further exploration of automatic object placement and interaction with the environment.	video editing, 3d model generation, object insertion, computer vision, deep learning
2402.14253 Report	MVD$^2$: Efficient Multiview 3D Reconstruction for Multiview Diffusion	Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, Yang Liu	As a promising 3D generation technique, multiview diffusion (MVD) has received a lot of attention due to its advantages in terms of generalizability, quality, and efficiency. By finetuning pretrained large image diffusion models with 3D data, the MVD methods first generate multiple views of a 3D object based on an image or text prompt and then reconstruct 3D shapes with multiview 3D reconstruction. However, the sparse views and inconsistent details in the generated images make 3D reconstruction challenging. We present MVD$^2$, an efficient 3D reconstruction method for multiview diffusion (MVD) images. MVD$^2$ aggregates image features into a 3D feature volume by projection and convolution and then decodes volumetric features into a 3D mesh. We train MVD$^2$ with 3D shape collections and MVD images prompted by rendered views of 3D shapes. To address the discrepancy between the generated multiview images and ground-truth views of the 3D shapes, we design a simple-yet-efficient view-dependent training scheme. MVD$^2$ improves the 3D generation quality of MVD and is fast and robust to various MVD methods. After training, it can efficiently decode 3D meshes from multiview images within one second. We train MVD$^2$ with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its superior performance in generating 3D models from multiview images generated by different MVD methods, using both synthetic and real images as prompts.	This paper presents \mvd, an efficient multiview 3D reconstruction method specifically designed to address the challenges of sparse views and inconsistent details in images generated by Multiview Diffusion (MVD) models.	Existing 3D reconstruction techniques struggle with the unique characteristics of MVD-generated images, leading to low-quality 3D models. \mvd aims to improve the quality and efficiency of 3D generation using MVD.	\mvd employs a lightweight neural network that aggregates image features from multiple views into a 3D feature volume. It then decodes this volume into a differentiable 3D mesh. To address inconsistencies, a view-dependent training scheme is introduced, prioritizing pixel-level alignment at the reference view and structural similarity at other views.	\mvd significantly improves the quality of 3D reconstruction from MVD images, outperforming methods like NeuS in metrics such as SSIM and LPIPS. The method is highly efficient, capable of decoding a 3D mesh from MVD images within one second. Demonstrating strong generalizability, \mvd effectively reconstructs 3D shapes from images generated by various MVD models, including those conditioned on text and images.	Limitations: Struggles with reconstructing unseen geometry if hidden in all input views. Performance degrades with significant inconsistencies between input MVD images. Future Work: Explore higher grid resolutions for finer detail reconstruction. Address inconsistencies and inpainting challenges in texture mapping.	3d reconstruction, multiview diffusion, view synthesis, deep learning, computer vision
2402.14167 Report	T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching	Zizheng Pan, Bohan Zhuang, De-An Huang, Weili Nie, Zhiding Yu, Chaowei Xiao, Jianfei Cai, Anima Anandkumar	Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Instead of solely using a large DPM for the entire sampling trajectory, T-Stitch first leverages a smaller DPM in the initial steps as a cheap drop-in replacement of the larger DPM and switches to the larger DPM at a later stage. Our key insight is that different diffusion models learn similar encodings under the same training data distribution and smaller models are capable of generating good global structures in the early steps. Extensive experiments demonstrate that T-Stitch is training-free, generally applicable for different architectures, and complements most existing fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL, for example, 40% of the early timesteps can be safely replaced with a 10x faster DiT-S without performance drop on class-conditional ImageNet generation. We further show that our method can also be used as a drop-in technique to not only accelerate the popular pretrained stable diffusion (SD) models but also improve the prompt alignment of stylized SD models from the public model zoo. Code is released at https://github.com/NVlabs/T-Stitch	Introduces T-Stitch, a technique to accelerate diffusion model sampling by using smaller models in early denoising steps and larger models in later steps.	Sampling from large diffusion models is computationally expensive, limiting practical applications.	Leverages the observation that different diffusion models learn similar latent representations, allowing direct stitching of models at different timesteps. Allocates smaller models to early steps and larger models to later steps.	Achieves up to 1.7x speedup with negligible performance drop on DiT models for ImageNet generation. Demonstrates general applicability across architectures (DiT, U-Net) and samplers (DDPM, DDIM, DPM-Solver). Shows compatibility and improvement with Stable Diffusion, including acceleration and enhanced prompt alignment for stylized models.	Relies on the availability of a smaller model trained on the same data distribution. Introduces a slight increase in memory usage due to loading an additional model.	diffusion models, sampling acceleration, trajectory stitching, model compression, text-to-image generation
2402.14000 Report	Real-time 3D-aware Portrait Editing from a Single Image	Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen	This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our system achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style). The code, the model, and the interface will be made publicly available to facilitate future research.	Presents 3DPE, a real-time 3D-aware portrait editing method that uses image or text prompts for editing face images in a 3D-consistent manner.	Real-time 3D portrait editing is crucial for AR/VR, 3D telepresence, and video conferencing, but existing methods are either slow or lack 3D consistency.	Distills knowledge from a 3D portrait generator (Live3D) and a text-guided image editing model (InstructPix2Pix) into a lightweight module, allowing for real-time editing while maintaining 3D consistency.	Achieves real-time editing speed of 40ms per image on a standard GPU. Exhibits superior 3D consistency, accurate texture alignment, and better identity preservation compared to baselines. Supports fast adaptation to user-specified editing prompts in just 5 minutes using 10 image pairs.	Novel view rendering can have inconsistencies in details due to reliance on a super-resolution module. Video editing can have flickering artifacts as the model is designed for image editing. Future work can focus on addressing these limitations and exploring higher-quality 3D representations.	3d-aware portrait editing, real-time editing, knowledge distillation, single image editing, customized prompt adaptation
2402.13929 Report	SDXL-Lightning: Progressive Adversarial Diffusion Distillation	Shanchuan Lin, Anran Wang, Xiao Yang	We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.	This paper introduces SDXL-Lightning, a novel progressive adversarial diffusion distillation method that achieves state-of-the-art one-step/few-step 1024px text-to-image generation.	Diffusion models are computationally expensive due to the iterative sampling procedure. This work significantly reduces the required steps for fast, high-quality image generation.	This work combines progressive distillation with a novel adversarial objective that utilizes the diffusion model's U-Net encoder as the discriminator backbone. It also introduces several techniques for stable training, schedule modification, and mode coverage relaxation.	The proposed method achieves superior image quality compared to other state-of-the-art distillation methods like SDXL-Turbo and LCM, especially in high-resolution details. The method allows for flexible control over the generated images, demonstrated through compatibility with ControlNet for conditioning on canny edges and depth maps. The authors open-source SDXL-Lightning, offering both full UNet weights and lightweight LoRA modules for plug-and-play use with other base models.	The current method requires separate checkpoints for each inference step setting, unlike some other approaches that utilize a single checkpoint. The authors believe the UNet architecture might not be optimal for one-step generation, suggesting exploration of more efficient architectures as future work.	diffusion models, text-to-image generation, model distillation, adversarial training, sdxl
2402.13729 Report	Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation	Kihong Kim, Haneol Lee, Jihye Park, Seyeon Kim, Kwanghee Lee, Seungryong Kim, Jaejun Yoo	Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).	This paper presents HVDM, a novel hybrid video diffusion model for high-quality video generation. HVDM leverages a hybrid video autoencoder combining 2D triplane projections for global context and 3D wavelet representations for local volume information, enhancing video encoding and generation.	Generating high-quality videos is challenging due to their high dimensionality and complexity. Existing methods struggle to balance efficiency and the ability to capture spatio-temporal dependencies effectively. HVDM addresses these challenges by combining the strengths of 2D and 3D representations in a novel autoencoder architecture.	HVDM employs a hybrid video autoencoder that extracts a disentangled representation: (1) global context via 2D projected latents from triplane representations, (2) local volume information via 3D CNNs with wavelet decomposition, and (3) frequency information for improved reconstruction. A diffusion model trained on this latent space generates videos.	HVDM achieves state-of-the-art video generation quality on benchmarks like UCF101, SkyTimelapse, and TaiChi, outperforming existing methods in both quantitative metrics and qualitative visual fidelity. The hybrid autoencoder effectively captures both global context and local details, leading to more realistic and coherent video generation. The use of wavelet decomposition and frequency matching loss contributes to preserving finer details and improving reconstruction quality.	The paper acknowledges limitations in applying the model to large-scale text-to-video generation tasks due to computational resources. Future work will explore diffusion model architectures specifically designed for the hybrid latent space and investigate more efficient wavelet filter banks for video.	video generation, diffusion models, video autoencoders, triplane representation, wavelet transform
2402.13616 Report	YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information	Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao	Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.	Proposed YOLOv9, a new object detection system leveraging Programmable Gradient Information (PGI) and a novel Generalized Efficient Layer Aggregation Network (GELAN) architecture.	Addresses information loss during feedforward in deep networks (information bottleneck), enabling reliable gradient generation and efficient training even for lightweight models.	Introduces PGI, comprising a main branch for inference, an auxiliary reversible branch for reliable gradient generation, and multi-level auxiliary information to handle error accumulation in deep supervision. Also designs GELAN, generalizing ELAN architecture to support diverse computational blocks for flexibility and efficiency.	YOLOv9 achieves state-of-the-art performance on MS COCO, outperforming existing real-time object detectors across various model sizes. GELAN demonstrates strong and stable performance with diverse computational blocks and depths, enabling flexible model design for various hardware. PGI effectively mitigates information bottleneck and improves accuracy in both lightweight and deep models, enabling better gradient utilization and accurate data-target mapping.	Further exploration of reversible architectures and integration networks for PGI can potentially yield additional performance gains. The study primarily focuses on object detection; applying PGI to other computer vision tasks can further validate its effectiveness.	object detection, deep learning, information bottleneck, reversible architectures, auxiliary supervision
2402.13573 Report	ToDo: Token Downsampling for Efficient Generation of High-Resolution Images	Ethan Smith, Nayan Saxena, Aninda Saha	Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.	This paper proposes ToDo, a training-free token downsampling method to accelerate Stable Diffusion inference by leveraging the inherent spatial redundancy in images to reduce the computational burden of attention.	The quadratic computational complexity of attention in image diffusion models limits the image sizes that can be processed efficiently. Sparse attention mechanisms offer a solution but often require training-time modifications, introducing logistical overheads.	ToDo downsamples key and value tokens using a Nearest-Neighbor algorithm based on spatial contiguity, reducing the token count while preserving query tokens, and eliminating the need for computationally expensive similarity calculations.	ToDo achieves up to 2x speedup for common image sizes and up to 4.5x or more for high resolutions (e.g., 2048x2048) compared to standard Stable Diffusion. ToDo outperforms previous methods like ToMe in balancing inference speed and generated image fidelity, as demonstrated by lower MSE and comparable HPF values. Analysis of latent features in Stable Diffusion's U-Net reveals high redundancy among spatially adjacent tokens, supporting the principle behind ToDo.	The differentiability of ToDo and its potential for efficient fine-tuning of Stable Diffusion at larger image dimensions remain unexplored. Further investigation is needed to determine the generalizability of ToDo's benefits to other attention-based generative image models.	image generation, diffusion models, stable diffusion, attention mechanism, sparse attention
2402.13490 Report	Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models	Chen Wu, Fernando De la Torre	Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.	This paper proposes a simple but effective method, Contrastive Guidance, which leverages contrastive prompts to disentangle image factors in text-to-image diffusion models, leading to fine-grained control over image generation.	Text-to-image diffusion models often lack fine-grained control, as changing even a single token can lead to unintended consequences in the generated image. This method addresses this challenge by allowing for more precise manipulation of specific image factors.	The method introduces a baseline prompt alongside the positive prompt, where the baseline prompt helps to isolate the intended image factor by providing a contrasting reference. The difference between the score functions of these prompts guides the denoising process, enhancing control over the desired image aspect.	Contrastive Guidance shows improved disentanglement compared to classifier-free guidance, enabling more precise control over image attributes, backgrounds, and objects. The method effectively guides domain-specific diffusion models, improving realism and domain specificity while maintaining consistency with text prompts. Contrastive Guidance proves beneficial for zero-shot image editing, strengthening intended edits and improving content preservation in tasks like style transfer and object manipulation.	The assumption of an adaptive temperature parameter to simplify calculations might not hold true across all domains. Further research is needed to understand the impact of different prompt pair choices on the performance and potential biases.	text-to-image synthesis, diffusion models, disentanglement, contrastive learning, image editing
2402.13404 Report	Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control	Denis Lukovnikov, Asja Fischer	While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation. We adapt and investigate several existing cross-attention control methods in the context of ControlNet and identify shortcomings that cause failure (concept bleeding) or image degradation under specific conditions. To address these shortcomings, we develop a novel cross-attention manipulation method in order to maintain image quality while improving control. Qualitative and quantitative experimental studies focusing on challenging cases are presented, demonstrating the effectiveness of the investigated general approach, and showing the improvements obtained by the proposed cross-attention control method.	This LaTeX document provides a template and guidelines for formatting papers to be submitted to the IEEE Computer Society Press.	This ensures a consistent and professional style for all submissions, aiding in the review process and enhancing readability.	The document outlines specific formatting requirements for various elements like paper length, margins, type style, headings, references, figures, tables, and more.	It emphasizes the importance of clear and concise writing, proper mathematical notation, and the use of cross-references. The guide also stresses the need for blind review anonymity and provides instructions on how to achieve it. It includes directions on handling supplementary material and final copy submission.	The template focuses heavily on LaTeX, potentially limiting accessibility for users of other document preparation systems. It lacks detailed explanations on certain aspects, such as color use, which are deferred to external guidelines.	latex, ieee, paper formatting, academic writing, conference submission
2402.13369 Report	The Uncanny Valley: A Comprehensive Analysis of Diffusion Models	Karam Ghanem, Danilo Bzdok	Through Diffusion Models (DMs), we have made significant advances in generating high-quality images. Our exploration of these models delves deeply into their core operational principles by systematically investigating key aspects across various DM architectures: i) noise schedules, ii) samplers, and iii) guidance. Our comprehensive examination of these models sheds light on their hidden fundamental mechanisms, revealing the concealed foundational elements that are essential for their effectiveness. Our analyses emphasize the hidden key factors that determine model performance, offering insights that contribute to the advancement of DMs. Past findings show that the configuration of noise schedules, samplers, and guidance is vital to the quality of generated images; however, models reach a stable level of quality across different configurations at a remarkably similar point, revealing that the decisive factors for optimal performance predominantly reside in the diffusion process dynamics and the structural design of the model's network, rather than the specifics of configuration details. Our comparative analysis reveals that Denoising Diffusion Probabilistic Model (DDPM)-based diffusion dynamics consistently outperform the Noise Conditioned Score Network (NCSN)-based ones, not only when evaluated in their original forms but also when continuous through Stochastic Differential Equation (SDE)-based implementations.	This paper presents a comprehensive analysis of Diffusion Models (DMs), focusing on noise schedules, samplers, and guidance to understand their impact on image generation quality.	DMs have revolutionized image generation but their complex dynamics are not fully understood. This work aims to clarify the key drivers of DM performance for future model development.	The authors conduct systematic benchmarking of various DM architectures (DDPMs, NCSNs, SDE-based) trained on CIFAR10 and FFHQ datasets. They analyze the impact of different noise schedules, samplers, and guidance mechanisms on Inception Score (IS) and visual quality of generated images.	DDPM-based diffusion dynamics consistently outperform NCSN-based ones across different configurations and datasets. The choice of noise schedule and sampler influences convergence speed, but DDPM-based schedules (cosine, sigmoid) generally excel. Classifier Guidance does not inherently enhance overall image quality and its impact is negligible compared to the diffusion process and network design.	The study primarily focuses on IS and visual inspection, which might not fully capture all aspects of image quality. Future work could explore the interplay of network design and diffusion process in more depth, potentially leading to novel DM architectures.	diffusion models, image generation, noise schedules, samplers, classifier guidance
2402.13349 Report	Aria Everyday Activities Dataset	Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, Kiran Somasundaram, Luis Pesqueira, Mark Schwesinger, Omkar Parkhi, Qiao Gu, Renzo De Nardi, Shangyi Cheng, Steve Saarinen, Vijay Baiyya, Yuyang Zou, Richard Newcombe, Jakob Julian Engel, Xiaqing Pan, Carl Ren	We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data including high frequency globally aligned 3D trajectories, scene point cloud, per-frame 3D eye gaze vector and time aligned speech transcription. In this paper, we demonstrate a few exemplar research applications enabled by this dataset, including neural scene reconstruction and prompted segmentation. AEA is an open source dataset that can be downloaded from https://www.projectaria.com/datasets/aea/. We are also providing open-source implementations and examples of how to use the dataset in Project Aria Tools https://github.com/facebookresearch/projectaria_tools.	The Aria Everyday Activities (AEA) dataset is an open dataset of egocentric multimodal data captured using Project Aria glasses. It contains 143 daily activity sequences in diverse indoor locations, featuring high-frequency 6DoF trajectories, scene point clouds, 3D eye gaze vectors, and time-aligned speech transcriptions.	AEA facilitates research in contextual AI and AR by providing rich, realistic, and spatially-temporally aligned data, addressing limitations of existing egocentric datasets that lack sensor modalities or precise 3D information.	Multiple wearers recorded daily activities in five indoor locations using Project Aria glasses, capturing RGB video, monochrome scene videos, eyetracking videos, IMU data, spatial audio, and other sensor data. Machine Perception Services (MPS) provided precise 3D localization, eye gaze vectors, and time synchronization across devices.	The dataset enables accurate 3D scene reconstruction using methods like Gaussian Splatting, leveraging the precise trajectory and point cloud data. AEA facilitates research in multimodal understanding, demonstrated through examples of eye gaze-prompted segmentation and speech-grounded segmentation using EfficientSAM and GroundingDino. AEA provides a valuable resource for studying real-world human activities with spatial-temporal context, enabling the development of personalized and context-aware AI assistants.	Current reconstruction methods may not handle dynamic motions in the recordings optimally. Future work includes reconstructing the AEA dataset using NeRFstudio and exploring advanced methods for activity and scene understanding.	egocentric vision, multimodal ai, 3d reconstruction, eye tracking, dataset
2402.13252 Report	Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields	Bo-Yu Cheng, Wei-Chen Chiu, Yu-Lun Liu	In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.	This paper proposes a method for joint refinement of camera pose and scene geometry represented by a decomposed low-rank tensor using only 2D images.	Existing methods for joint pose optimization struggle with voxel-based NeRFs due to their tendency to overemphasize sharp edges, leading to sub-optimal solutions.	The authors conduct a spectral analysis on a 1D signal alignment task and draw parallels to 3D joint optimization. Based on their findings, they propose a coarse-to-fine training schedule with separable component-wise convolution of Gaussian filters applied to both 2D and 3D radiance fields. Additionally, techniques like smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss masks are introduced to enhance robustness.	The method achieves superior performance in novel view synthesis compared to previous approaches. It exhibits faster convergence, requiring only 50k training iterations compared to 200k iterations needed by other methods. The approach is shown to be effective and robust for both synthetic and real-world scenes.	The current implementation relies on PyTorch and could potentially achieve faster speeds with custom CUDA acceleration. Future work could explore the applicability of the proposed techniques to other compressed voxel-based architectures like multi-resolution hash encoding.	neural rendering, novel view synthesis, joint pose optimization, decomposed low-rank tensor, gaussian filtering
2402.13251 Report	FlashTex: Fast Relightable Mesh Texturing with LightControlNet	Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, Maneesh Agrawala	Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.	This paper introduces a novel approach for rapid and automatic texturing of 3D meshes based on user-provided text prompts, enabling relighting by separating lighting from surface material.	Creating realistic textures for 3D models is crucial in various industries, but manual methods are time-consuming and require expertise. Existing automatic methods are slow, prone to visual artifacts, and often bake lighting into the texture, limiting their usability.	The proposed two-stage pipeline utilizes a new illumination-aware text-to-image model, LightControlNet. Stage 1 generates consistent reference views of the mesh under fixed lighting using multi-view visual prompting. Stage 2 optimizes texture quality and disentangles lighting using an improved Score Distillation Sampling (SDS) method with LightControlNet.	The method generates high-quality, relightable textures significantly faster than previous approaches. Quantitative evaluations demonstrate superior performance over existing baselines in FID and KID metrics. User studies confirm preference for the method's output in realism, consistency with text prompts, and plausibility under varying lighting.	Limitations include occasional baked-in lighting, imperfect material map disentanglement, and potential failure to fully adhere to complex text prompts. Future work involves addressing these limitations and exploring applications in related 3D content creation tasks.	text-to-texture, 3d mesh texturing, relightable texture, diffusion models, controlnet
2402.13217 Report	VideoPrism: A Foundational Visual Encoder for Video Understanding	Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong	We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 30 out of 33 video understanding benchmarks.	VideoPrism, a general-purpose video encoder pretrained on a large-scale dataset of video-text pairs and video-only clips, achieves state-of-the-art performance on a wide range of video understanding tasks using a single frozen model.	Existing video foundation models often struggle with balancing appearance-heavy tasks and motion-centric reasoning, and building a truly foundational video model that excels across diverse tasks remains a challenge.	VideoPrism is pretrained in two stages: 1) contrastive learning aligns a video encoder and a text encoder on video-text pairs, 2) masked video modeling with global-local distillation and token shuffling trains the video encoder on video-only data, leveraging knowledge from the first stage.	Outperforms previous video foundation models on 30 out of 33 video understanding benchmarks, including VideoGLUE, zero-shot video-text retrieval, and CV for science tasks. Demonstrates robust generalizability, excelling on both appearance- and motion-focused tasks across diverse video sources. Shows strong scaling capabilities with both model size and data size, achieving substantial improvements with larger models and datasets.	Reliance on noisy text data in the pretraining corpus might introduce potential biases and limitations. The current focus on short video clips limits the model's applicability to long video understanding.	video foundation model, vision-language model, self-supervised learning, contrastive learning, masked video modeling
2402.13185 Report	UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing	Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian	Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency respectively, we introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers. Extensive experiments demonstrate that UniEdit covers video motion editing and various appearance editing scenarios, and surpasses the state-of-the-art methods. Our code will be publicly available.	Introduces UniEdit, a tuning-free framework for video motion and appearance editing utilizing a pre-trained text-to-video generator.	Addresses limitations in existing video editing methods by enabling motion editing in addition to appearance editing without fine-tuning.	Employs an inversion-then-generation pipeline with auxiliary branches for reconstruction and motion reference. Features from these branches are injected into the main editing path via spatial and temporal self-attention layers to achieve content preservation and motion control.	Achieves superior performance compared to state-of-the-art methods in both qualitative and quantitative evaluations. Demonstrates the ability to edit various aspects of videos, including motion, style, object replacement, and background. Enables text-image-to-video generation by combining image animation techniques with UniEdit's editing capabilities.	Simultaneous motion and appearance editing within a single iteration requires further exploration. Developing an automatic scheme for determining optimal hyper-parameters is an area for future work.	video editing, motion editing, appearance editing, diffusion models, text-to-video generation
2402.13144 Report	Neural Network Diffusion	Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You	Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a standard latent diffusion model. The autoencoder extracts latent representations of a subset of the trained network parameters. A diffusion model is then trained to synthesize these latent parameter representations from random noise. It then generates new representations that are passed through the autoencoder's decoder, whose outputs are ready to use as new subsets of network parameters. Across various architectures and datasets, our diffusion process consistently generates models of comparable or improved performance over trained networks, with minimal additional cost. Notably, we empirically find that the generated models perform differently with the trained networks. Our results encourage more exploration on the versatile use of diffusion models.	This paper introduces 'neural network diffusion (p-diff),' a novel approach using diffusion models to generate high-performing neural network parameters.	This work explores the under-explored potential of diffusion models beyond visual generation, offering a new paradigm for generating effective network parameters.	P-diff utilizes an autoencoder to learn latent representations of a subset of trained network parameters and employs a standard latent diffusion model to synthesize new representations from random noise. The synthesized representations are then decoded to obtain new network parameters.	P-diff consistently achieves comparable or even superior performance to models trained by SGD across diverse datasets and architectures. The generation process is efficient, generating new models within seconds. Analysis reveals that the generated models exhibit distinct prediction patterns compared to the original training models, indicating genuine parameter synthesis rather than mere memorization.	Current limitations include constraints in generating entire parameters of large architectures due to GPU memory. Future work will focus on addressing memory limitations, enhancing structure design efficiency, and improving performance stability.	diffusion models, parameter generation, neural networks, deep learning, generative models
2402.13126 Report	VGMShield: Mitigating Misuse of Video Generative Models	Yan Pang, Yang Zhang, Tianhao Wang	With the rapid advancement in video generation, people can conveniently utilize video generation models to create videos tailored to their specific desires. Nevertheless, there are also growing concerns about their potential misuse in creating and disseminating false information. In this work, we introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We start from \textit{fake video detection} trying to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos; then, we investigate the \textit{tracing} problem, which maps a fake video back to a model that generates it. Towards these, we propose to leverage pre-trained models that focus on {\it spatial-temporal dynamics} as the backbone to identify inconsistencies in videos. Through experiments on seven state-of-the-art open-source models, we demonstrate that current models still cannot perfectly handle spatial-temporal relationships, and thus, we can accomplish detection and tracing with nearly perfect accuracy. Furthermore, anticipating future generative model improvements, we propose a {\it prevention} method that adds invisible perturbations to images to make the generated videos look unreal. Together with fake video detection and tracing, our multi-faceted set of solutions can effectively mitigate misuse of video generative models.	This paper introduces the first defense pipeline, called MMVGM, specifically designed to address misuse issues in video generation models.	The rapid advancement of video generation models raises concerns about their potential misuse in creating and spreading misinformation. MMVGM aims to mitigate these concerns by providing tools for detecting, tracing, and preventing the generation of fake videos.	MMVGM leverages pre-trained video recognition models (I3D, X-CLIP, VideoMAE) to detect spatial-temporal inconsistencies in generated videos for both fake video detection and tracing the source model. Additionally, it introduces two misuse prevention methods based on adversarial examples that disrupt video generation by adding imperceptible perturbations to images.	VideoMAE-based detection and tracing models achieve high accuracy (over 90%) in various realistic scenarios, demonstrating the presence of model-specific 'fingerprints' in generated videos. Analysis using Grad-CAM reveals that the VideoMAE-based model is particularly adept at identifying temporal anomalies, outperforming I3D which mainly focuses on spatial distortions. Both directed and undirected defense strategies successfully disrupt video generation by introducing imperceptible perturbations to images, effectively preventing misuse.	The effectiveness of the proposed methods might be challenged as video generation models evolve to produce more realistic videos. Directed defense, while effective, requires careful selection of target images for optimal performance.	video generation, misinformation detection, source tracing, adversarial defense, video forensics
2402.12974 Report	Visual Style Prompting with Swapping Self-Attention	Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, Youngjung Uh	In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a consistent style, requiring costly fine-tuning or often inadequately transferring the visual elements due to content leakage. To address these challenges, we propose a novel approach, \ours, to produce a diverse range of images while maintaining specific style elements and nuances. During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers. This approach allows for the visual style prompting without any fine-tuning, ensuring that generated images maintain a faithful style. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately. Our project page is available https://curryjung.github.io/VisualStylePrompt/.	This paper introduces Visual Style Prompting with Swapping Self-Attention, a novel method to generate images that reflect the style of a reference image while adhering to the content specified in a text prompt, all without requiring fine-tuning.	Existing text-to-image generation models struggle to achieve controlled generation with consistent styles. This new approach aims to overcome limitations of existing methods that require costly fine-tuning and often suffer from content leakage.	The method leverages a swapping self-attention mechanism. It maintains the queries from original image features while swapping the keys and values with those from reference image features in the late self-attention layers of a diffusion model.	The approach successfully generates images reflecting the style of reference images while minimizing content leakage. It outperforms existing methods in terms of style fidelity, text prompt alignment, and diversity of generated images. The method is versatile and compatible with other techniques like ControlNet and Dreambooth-LoRA.	The method is limited by the capabilities of the pre-trained diffusion model used. Future work includes exploring better inversion methods for real images and extending the approach to other domains like video.	text-to-image generation, diffusion models, visual style prompting, swapping self-attention, content leakage
2402.12927 Report	CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection	Sohail Ahmed Khan, Duc-Tien Dang-Nguyen	The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.	This paper investigates the effectiveness of adapting pre-trained vision-language models (VLMs), specifically CLIP, for universal deepfake detection by leveraging both visual and textual information.	Existing deepfake detection models often struggle to generalize across different data distributions due to their focus on detecting specific artifacts. This work explores the potential of VLMs, which are trained on diverse datasets and possess strong zero-shot capabilities, to overcome this limitation.	The authors adapt CLIP for deepfake detection using four transfer learning strategies: Linear Probing, Fine-tuning, Prompt Tuning (CoOp), and Adapter Network. They train the models on the ProGAN dataset and evaluate them on a comprehensive test set of 21 different image generators, including GANs, Diffusion models, and commercial tools.	Adapting CLIP using both visual and textual components significantly outperforms methods relying solely on visual features. Prompt Tuning with CoOp achieves state-of-the-art performance, surpassing previous methods in both mAP and average accuracy while using less training data. CLIP-based detectors demonstrate robust performance even with limited training data and in the presence of post-processing operations.	The paper focuses on single-image deepfake detection and does not explore video-based deepfakes. Further research is needed to investigate the performance of the proposed methods on emerging deepfake generation techniques.	deepfake detection, transfer learning, vision-language models, clip, prompt tuning
2402.12908 Report	RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models	Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui	Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: https://github.com/YangLing0818/RealCompo	This paper proposes RealCompo, a training-free and transferred-friendly text-to-image generation framework that balances realism and compositionality by dynamically combining the strengths of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints, segmentation maps).	Existing text-to-image models struggle with accurately aligning with prompts involving multiple objects or complex relationships, highlighting the need for improved compositional generation while maintaining realism.	RealCompo utilizes a novel "balancer" that dynamically adjusts the influence of predicted noise from both a text-to-image model and a spatial-aware image diffusion model based on their cross-attention maps during the denoising process.	RealCompo outperforms state-of-the-art text-to-image and layout-to-image models in compositional generation benchmarks (T2I-CompBench). RealCompo exhibits superior image realism and aesthetic quality compared to baselines, as evidenced by higher CLIP and aesthetic scores. RealCompo demonstrates strong generalizability and can be extended to various spatial-aware conditions and stylized image generation tasks.	RealCompo's computational cost is slightly higher than single-model approaches. Future work includes exploring more computationally efficient methods and extending RealCompo to text-to-video or text-to-3D generation.	text-to-image generation, compositionality, diffusion models, spatial awareness, controllable generation
2402.12760 Report	A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis	Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang	Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics.	This paper introduces CFP, a novel dataset bridging the gap between user input and model-preferred prompts for text-to-image generation, and proposes UF-FGTG, a user-friendly framework for automated prompt optimization.	Novice users often struggle to craft effective prompts for text-to-image models. This work addresses this by aligning user input with model preferences and improving image generation quality.	The UF-FGTG framework utilizes a prompt refiner to transform coarse-grained prompts into fine-grained ones, incorporates image-related loss functions for model-preferred prompts, and employs an adaptive feature extraction module for result diversity.	UF-FGTG generates visually appealing images superior to existing language models like GPT-4. The framework consistently outperforms other methods in image quality and aesthetic assessments, demonstrating a 5% improvement. The adaptive feature extraction module effectively enhances the diversity of generated images.	The study primarily focuses on Stable Diffusion, potentially limiting generalizability to other text-to-image models. Exploration of alternative adaptive feature extraction modules and prompt refinement techniques could further enhance performance.	text-to-image generation, prompt engineering, dataset creation, deep learning, computer vision
2402.12741 Report	MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion	Sen Li, Ruochen Wang, Cho-Jui Hsieh, Minhao Cheng, Tianyi Zhou	Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. In this paper, we develop a training-free Multimodal-LLM agent (MuLan) to address these challenges by progressive multi-object generation with planning and feedback control, like a human painter. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object conditioned on previously generated objects by stable diffusion. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined by an LLM and attention guidance upon each sub-task. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines. The code is available on https://github.com/measure-infinity/mulan-code.	This paper introduces MuLan, a training-free Multimodal-LLM Agent, designed to enhance the quality of images generated from intricate text prompts containing multiple objects, particularly by improving spatial relationships and attribute bindings, commonly challenging for existing text-to-image models.	Current text-to-image models struggle to accurately represent complex prompts involving multiple objects with specific attributes and spatial relationships. MuLan addresses this limitation by utilizing the strengths of LLMs, diffusion models, and VLMs in a collaborative framework.	MuLan decomposes a complex prompt into a sequence of simpler sub-prompts using an LLM planner. It then progressively generates one object per stage, guided by an LLM-generated rough mask and refined by attention guidance within a diffusion model. A VLM feedback loop ensures each stage aligns with the prompt, allowing for adjustments before proceeding.	MuLan significantly outperforms baseline models (including SDXL, PixArt-α) in generating images from complex prompts containing multiple objects with specific attributes and spatial arrangements, as evaluated by both GPT-4V and human assessors. The integration of VLM feedback control is crucial, leading to substantial performance improvements compared to a version of MuLan without this component. MuLan demonstrates flexibility by effectively using various VLMs (LLaVA, GPT-4V, Gemini-Pro) without significant performance differences.	The multi-stage generation process in MuLan, while allowing for fine-grained control, can be more time-consuming than single-stage generation methods. Potential errors in prompt decomposition by the LLM could cascade through the generation process. Future work could explore LLM-based prompt rewriting to minimize such errors.	text-to-image generation, multimodal-llm, diffusion models, controllable generation, vlm feedback
2402.12712 Report	MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction	Shitao Tang, Jiacheng Chen, Dilin Wang, Chengzhou Tang, Fuyang Zhang, Yuchen Fan, Vikas Chandra, Yasutaka Furukawa, Rakesh Ranjan	This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffusion++ achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A ``pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) A ``view dropout strategy'' that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining MVDiffusion++ with a text-to-image generative model. The project page is at https://mvdiffusion-plusplus.github.io.	Presents MVDiffusion++, a novel multi-view diffusion model for reconstructing dense, high-resolution 3D objects from single or sparse unposed images.	Addresses limitations of existing methods that struggle with high-resolution outputs and rely on accurate camera pose estimation, enabling more flexible and scalable 3D object reconstruction.	Introduces a pose-free architecture with self-attention among 2D latent features to learn 3D consistency across views. Employs a view dropout strategy during training to reduce memory footprint and enable high-resolution image generation.	Achieves state-of-the-art performance on single-view reconstruction, outperforming SyncDreamer by 0.1552 in Vol. IoU on Google Scanned Objects dataset. Significantly improves novel view synthesis quality in sparse view settings, surpassing LEAP by 8.19 PSNR. Demonstrates successful text-to-3D applications by integrating with text-to-image generative models.	Struggles with reconstructing thin object structures. May generate implausible images for occluded views.	3d reconstruction, diffusion models, multi-view image generation, pose-free, view synthesis
2402.12550 Report	Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization	James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis Patras	The Mixture of Experts (MoE) paradigm provides a powerful way to decompose inscrutable dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. A major problem however lies in the computational cost of scaling the number of experts to achieve sufficiently fine-grained specialization. In this paper, we propose the Multilinear Mixutre of Experts (MMoE) layer to address this, focusing on vision models. MMoE layers perform an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, MMoEs both (1) avoid the issues incurred through the discrete expert routing in the popular 'sparse' MoE models, yet (2) do not incur the restrictively high inference-time costs of 'soft' MoE alternatives. We present both qualitative and quantitative evidence (through visualization and counterfactual interventions respectively) that scaling MMoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level whilst remaining competitive with the performance of parameter-matched linear layer counterparts. Finally, we show that learned expert specialism further facilitates manual correction of demographic bias in CelebA attribute classification. Our MMoE model code is available at https://github.com/james-oldfield/MMoE.	This paper introduces the Multilinear Mixture of Experts (MMoE) layer, a novel architecture for deep learning models that allows for the efficient computation and fusion of a large number of expert operations.	The MMoE layer addresses the limitations of traditional MoEs (Mixture of Experts) in scaling to a large number of experts while promoting expert specialization and enabling interpretability and editability of the model.	MMoE leverages tensor factorization techniques (CP, Tucker, Tensor Train, Tensor Ring) to represent the weight tensor of experts in a compressed form, enabling efficient computation with tens of thousands of experts. The model learns to specialize experts towards subtasks by fine-tuning MMoE layers on various image classification tasks.	Scaling up the number of experts in MMoE leads to increased expert specialization, where individual experts learn to process specific classes or categories of images. MMoE's factorized architecture allows for manual editing of expert combinations to mitigate demographic bias in image classification, leading to improved fairness metrics. MMoE layers achieve competitive performance compared to parameter-matched linear layers when fine-tuning foundation models (CLIP, DINO) for image classification on various datasets.	The evaluation of expert behavior is primarily focused on in-domain data, and further investigation is needed to assess the generalization of MMoEs under domain shift. Future work could explore the application of MMoEs to natural language processing tasks and investigate their performance in broader settings.	mixture of experts, tensor factorization, interpretability, model editing, fairness
2402.12377 Report	Binary Opacity Grids: Capturing Fine Geometric Detail for Mesh-Based View Synthesis	Christian Reiser, Stephan Garbin, Pratul P. Srinivasan, Dor Verbin, Richard Szeliski, Ben Mildenhall, Jonathan T. Barron, Peter Hedman, Andreas Geiger	While surface-based view synthesis algorithms are appealing due to their low computational requirements, they often struggle to reproduce thin structures. In contrast, more expensive methods that model the scene's geometry as a volumetric density field (e.g. NeRF) excel at reconstructing fine geometric detail. However, density fields often represent geometry in a "fuzzy" manner, which hinders exact localization of the surface. In this work, we modify density fields to encourage them to converge towards surfaces, without compromising their ability to reconstruct thin structures. First, we employ a discrete opacity grid representation instead of a continuous density field, which allows opacity values to discontinuously transition from zero to one at the surface. Second, we anti-alias by casting multiple rays per pixel, which allows occlusion boundaries and subpixel structures to be modelled without using semi-transparent voxels. Third, we minimize the binary entropy of the opacity values, which facilitates the extraction of surface geometry by encouraging opacity values to binarize towards the end of training. Lastly, we develop a fusion-based meshing strategy followed by mesh simplification and appearance model fitting. The compact meshes produced by our model can be rendered in real-time on mobile devices and achieve significantly higher view synthesis quality compared to existing mesh-based approaches.	This paper presents a novel method for reconstructing compact triangle meshes from multi-view images, capable of capturing fine geometric detail like leaves and branches for real-time view synthesis.	Surface-based view synthesis, while efficient, struggles to reproduce thin structures, unlike computationally expensive volumetric methods. This work bridges this gap by enhancing surface-based methods to reconstruct fine details.	The method utilizes a high-resolution opacity grid, encouraging binary opacity values (0 or 1) through an entropy loss and supersampling. This allows precise surface localization, converting the grid into a simplified, real-time renderable mesh.	The approach achieves higher quality than existing mesh-based methods, especially for thin structures. The resulting meshes are compact enough for real-time rendering on mobile devices. The method outperforms BakedSDF, the previous state-of-the-art in mesh-based view synthesis, in both quality and compactness.	Training-time supersampling introduces significant computational overhead. Background reconstruction can be noisy, leading to larger mesh sizes, potentially mitigated by smoothness regularization.	novel view synthesis, differentiable rendering, neural radiance fields, multiview-to-3d, real-time rendering
2402.12376 Report	FiT: Flexible Vision Transformer for Diffusion Model	Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai	Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.	This paper introduces FiT, a Flexible Vision Transformer tailored for diffusion models, capable of generating images at any resolution and aspect ratio.	Existing diffusion models struggle to generalize across arbitrary resolutions and aspect ratios. FiT addresses this limitation by conceptualizing images as sequences of variable-length tokens, unlike traditional methods that rely on fixed-resolution grids.	The paper presents a three-pronged approach: 1) a flexible training pipeline that eliminates the need for cropping by resizing high-resolution images to a maximum token limit, 2) a unique transformer architecture utilizing 2D Rotary Positional Embedding (RoPE) and Masked MHSA to handle dynamic token lengths, and 3) a training-free resolution extrapolation method inspired by techniques used in large language models.	FiT significantly outperforms previous state-of-the-art models on class-conditional image generation benchmarks across various resolutions and aspect ratios. Flexible training with dynamic token lengths proves crucial for resolution generalization and surpasses the performance of fixed-resolution training. Training-free resolution extrapolation methods, specifically VisionNTK and VisionYaRN, further enhance FiT's ability to generate high-quality images at resolutions exceeding those seen during training.	Limited computational resources restricted the training of the largest FiT model, potentially hindering performance at the 256x256 resolution. The generative capabilities of FiT with higher resolution training and alternative resolution extrapolation techniques requiring additional training remain unexplored.	vision transformers, diffusion models, image generation, resolution extrapolation, arbitrary aspect ratio
2402.12336 Report	Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models	Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein	Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of VLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the VLM is required. The code and robust models are available at https://github.com/chs20/RobustVLM	The paper proposes FARE, an unsupervised adversarial fine-tuning scheme for the vision encoder of CLIP, to make vision-language models (VLMs) robust against adversarial attacks on images.	Large multi-modal foundation models are vulnerable to adversarial attacks, posing significant risks such as spreading misinformation and defrauding users. Robustness is crucial for their safe deployment.	FARE fine-tunes the vision encoder by minimizing the difference between its embeddings of perturbed and original images, preserving feature similarity to the original CLIP for clean inputs.	FARE makes VLMs like OpenFlamingo and LLaVA robust to imperceptible targeted attacks while maintaining high performance on clean data. FARE outperforms the supervised method TeCoA in terms of both robustness and clean performance across various downstream tasks. Robust CLIP models trained with FARE exhibit lower hallucination rates and better performance in chain-of-thought tasks.	The study focuses on CLIP-based VLMs and doesn't explore applicability to other architectures. The defense is focused on the vision modality, with the language side robustness left for future work.	adversarial robustness, vision-language models, clip, unsupervised adversarial training, multi-modal foundation models
2402.12259 Report	Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships	Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pedro Hermosilla, Timo Ropinski	Current approaches for 3D scene graph prediction rely on labeled datasets to train models for a fixed set of known object classes and relationship categories. We present Open3DSG, an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data. We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models. This enables us to predict 3D scene graphs from 3D point clouds in a zero-shot manner by querying object classes from an open vocabulary and predicting the inter-object relationships from a grounded LLM with scene graph features and queried object classes as context. Open3DSG is the first 3D point cloud method to predict not only explicit open-vocabulary object classes, but also open-set relationships that are not limited to a predefined label set, making it possible to express rare as well as specific objects and relationships in the predicted 3D scene graph. Our experiments show that Open3DSG is effective at predicting arbitrary object classes as well as their complex inter-object relationships describing spatial, supportive, semantic and comparative relationships.	This paper introduces the first approach for predicting open-vocabulary 3D scene graphs from point clouds, enabling the representation of scenes with arbitrary object classes and relationships.	Existing 3D scene graph prediction methods are limited to a fixed set of object and relationship labels, hindering their applicability in real-world scenarios requiring broader semantic understanding.	The method distills knowledge from 2D vision-language models into a 3D graph neural network. It uses CLIP for open-vocabulary object prediction and a grounded LLM for relationship prediction based on predicted object classes and learned relationship features.	The method outperforms existing methods on predicting rare object and predicate classes. It achieves comparable performance to fully supervised methods on a closed-set benchmark. Qualitative results demonstrate the capability to predict specific object classes and relationships.	Predicting diverse open-vocabulary relationships remains a challenge. Systematic evaluation of open-vocabulary 3D scene graphs is an open problem.	3d scene graph, open vocabulary, zero-shot learning, vision-language models, graph neural networks
2402.12121 Report	Evaluating Image Review Ability of Vision Language Models	Shigeki Saito, Kazuki Hayashi, Yusuke Ide, Yusuke Sakai, Kazuma Onishi, Toma Suzuki, Seiji Gobara, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe	Large-scale vision language models (LVLMs) are language models that are capable of processing images and text inputs by a single model. This paper explores the use of LVLMs to generate review texts for images. The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities. Unlike image captions, review texts can be written from various perspectives such as image composition and exposure. This diversity of review perspectives makes it difficult to uniquely determine a single correct review for an image. To address this challenge, we introduce an evaluation method based on rank correlation analysis, in which review texts are ranked by humans and LVLMs, then, measures the correlation between these rankings. We further validate this approach by creating a benchmark dataset aimed at assessing the image review ability of recent LVLMs. Our experiments with the dataset reveal that LVLMs, particularly those with proven superiority in other evaluative contexts, excel at distinguishing between high-quality and substandard image reviews.	This paper introduces a novel method for evaluating the ability of Large-scale Vision Language Models (LVLMs) to generate review texts for images, addressing the challenge of subjective review perspectives.	This evaluation is crucial for understanding LVLMs' capacity to provide detailed and objective feedback on images, potentially replacing human judges in assessment contexts.	The method involves ranking review texts generated by LVLMs and human annotators, then calculating the rank correlation to assess alignment. A new benchmark dataset with ranked reviews was created to validate this approach.	LVLMs, especially those excelling in other evaluation tasks, show increasing ability to distinguish high-quality from substandard reviews. The proposed evaluation method, based on rank correlation, proves effective in assessing LVLMs' review generation capabilities. Newer LVLMs demonstrate better support for multiple languages compared to earlier models.	The current method does not incorporate domain-specific knowledge for evaluation. The dataset, sourced from English Wikipedia, may contain inherent biases.	large-scale vision language models, image review generation, evaluation method, rank correlation analysis, benchmark dataset
2402.12004 Report	Direct Consistency Optimization for Compositional Text-to-Image Personalization	Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin	Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model. We devise a novel training objective for T2I diffusion models that minimally fine-tunes the pretrained model to achieve consistency. Our method, dubbed \emph{Direct Consistency Optimization}, is as simple as regular diffusion loss, while significantly enhancing the compositionality of personalized T2I models. Also, our approach induces a new sampling method that controls the tradeoff between image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using a comprehensive caption for reference images to further enhance the image-text alignment. We show the efficacy of the proposed method on the T2I personalization for subject, style, or both. In particular, our method results in a superior Pareto frontier to the baselines. Generated examples and codes are in our project page( https://dco-t2i.github.io/).	This paper proposes Direct Consistency Optimization (DCO), a novel fine-tuning objective for Text-to-Image (T2I) diffusion models that enhances compositionality in personalized image generation.	Existing T2I personalization methods, while effective in learning new concepts from few images, often suffer from reduced textual alignment and compositional generation capability due to knowledge forgetting and concept collapse.	DCO casts fine-tuning as a constrained policy optimization problem. It maximizes consistency to reference images while minimizing deviation from the pretrained model. This approach preserves the compositionality of the original model while incorporating new concepts.	DCO outperforms baselines like DreamBooth in subject and style personalization, demonstrating superior image-text alignment and subject fidelity. The paper introduces “reward guidance”, a sampling method that allows users to control the tradeoff between image fidelity and prompt fidelity. The authors emphasize the importance of using comprehensive captions for reference images to enhance model disentanglement and prevent concept collapse.	The current implementation of DCO increases computational cost due to additional inference steps during training and sampling. Future work could explore efficient fine-tuning methods to address this. While reward guidance sampling allows control over subject fidelity and textual alignment, finding the optimal guidance scale for a given dataset or prompt requires further investigation.	text-to-image synthesis, diffusion models, personalization, compositionality, fine-tuning
2402.11929 Report	DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation	Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, Xin Tong	This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.	This paper presents DiLightNet, a novel method for fine-grained lighting control during text-driven diffusion-based image generation by augmenting text prompts with radiance hints.	Existing diffusion models struggle to decouple lighting from image content and text prompts lack the expressive power for detailed lighting descriptions, limiting creative control over lighting.	The method involves three stages: (1) generating a provisional image with uncontrolled lighting from a text prompt, (2) resynthesizing the foreground using DiLightNet guided by radiance hints computed from a coarse depth estimate and the target lighting, and (3) inpainting a consistent background.	DiLightNet successfully controls lighting in generated images, enabling diverse lighting conditions for the same text prompt. Appearance-seed allows exploring plausible material interpretations, while prompt specialization offers additional control over material properties. Ablation study validates the importance of provisional image encoding, radiance hint selection, foreground masking, and data augmentation.	Material-light interactions might not perfectly align with the prompt due to limitations in text-based material control. Reliance on off-the-shelf depth and mask estimation can impact results when estimation is inaccurate.	diffusion models, image generation, lighting control, radiance hints, controlnet
2402.11849 Report	ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image	Yan Hong, Jianfu Zhang	Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.	ComFusion, a novel two-stream finetuning approach for personalized text-to-image generation that balances instance fidelity and scene fidelity across diverse scenes.	Existing methods struggle to maintain both instance fidelity (visual congruence with the instance image) and scene fidelity (aligning generated scenes with text prompts), especially in few-shot personalized generation.	ComFusion employs a composite stream with class-scene prior loss to preserve class and scene knowledge from the pretrained model, and a fusion stream with visual-textual matching loss to fuse instance visual features with textual scene information.	ComFusion outperforms baselines in quantitative metrics (CLIP-I, DINO, CLIP-T) demonstrating superior instance and scene fidelity. Human perceptual studies confirm ComFusion generates images with significantly better instance and scene fidelity compared to baselines. Ablation studies validate the contribution of both the composite and fusion streams, highlighting the importance of class-scene prior preservation and visual-textual feature fusion.	ComFusion shows limitations in understanding and rendering creative scenes, material properties, and complex composite semantics. Future work will focus on addressing these limitations to enhance the model's ability to handle more complex and nuanced scene descriptions.	text-to-image generation, personalized image generation, diffusion models, few-shot learning, instance fidelity, scene fidelity
2402.11846 Report	UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models	Yihua Zhang, Yimeng Zhang, Yuguang Yao, Jinghan Jia, Jiancheng Liu, Xiaoming Liu, Sijia Liu	The rapid advancement of diffusion models (DMs) has not only transformed various real-world industries but has also introduced negative societal concerns, including the generation of harmful content, copyright disputes, and the rise of stereotypes and biases. To mitigate these issues, machine unlearning (MU) has emerged as a potential solution, demonstrating its ability to remove undesired generative capabilities of DMs in various applications. However, by examining existing MU evaluation methods, we uncover several key challenges that can result in incomplete, inaccurate, or biased evaluations for MU in DMs. To address them, we enhance the evaluation metrics for MU, including the introduction of an often-overlooked retainability measurement for DMs post-unlearning. Additionally, we introduce UnlearnCanvas, a comprehensive high-resolution stylized image dataset that facilitates us to evaluate the unlearning of artistic painting styles in conjunction with associated image objects. We show that this dataset plays a pivotal role in establishing a standardized and automated evaluation framework for MU techniques on DMs, featuring 7 quantitative metrics to address various aspects of unlearning effectiveness. Through extensive experiments, we benchmark 5 state-of-the-art MU methods, revealing novel insights into their pros and cons, and the underlying unlearning mechanisms. Furthermore, we demonstrate the potential of UnlearnCanvas to benchmark other generative modeling tasks, such as style transfer. The UnlearnCanvas dataset, benchmark, and the codes to reproduce all the results in this work can be found at https://github.com/OPTML-Group/UnlearnCanvas.	This paper introduces UnlearnCanvas, a large-scale, high-resolution dataset designed to benchmark machine unlearning (MU) in diffusion models, specifically focusing on unlearning artistic styles and objects.	Existing MU evaluation methods for diffusion models suffer from limitations such as limited target diversity, imprecise evaluation, and a lack of retainability assessment, hindering the development and understanding of MU techniques.	The authors curate UnlearnCanvas with dual style-object supervision and high stylistic consistency. They also propose an evaluation pipeline that includes metrics for unlearning effectiveness, in-domain and cross-domain retainability, generation quality, and efficiency.	Retainability metrics are crucial for a comprehensive MU assessment, revealing significant performance differences not captured by unlearning accuracy alone. Cross-domain retainability is harder to maintain than in-domain retainability, highlighting a previously overlooked challenge. No single MU method excels in all aspects, indicating room for improvement and the need for a balanced approach.	The study primarily focuses on Stable Diffusion v1.5; evaluating other diffusion models is left for future work. Exploring the impact of varying dataset sizes and prompt complexities on unlearning performance is an area for further investigation.	machine unlearning, diffusion models, benchmarking, style transfer, generative ai
2402.11487 Report	Visual Concept-driven Image Generation with Text-to-Image Diffusion Model	Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal	Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent Dense CRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a bi-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively (through user studies) with a number of examples and use cases that can combine up to three entangled concepts.	This paper proposes a concept-driven text-to-image (TTI) personalization framework that disentangles multiple concepts from a single or multiple images for generating novel compositions and interactions.	Existing TTI personalization methods struggle to generate images with multiple interacting user-specified concepts, especially when entangled within a single illustration.	The method uses an Expectation Maximization (EM)-like optimization to jointly learn: 1) concept-specific tokens and 2) latent binary masks for each concept. It leverages cross-attention maps within the diffusion model to generate and refine these masks.	Quantitative user studies show that the proposed method significantly outperforms baselines in generating faithful and controllable images with multiple interacting concepts. The approach can effectively disentangle concepts from a single image, removing the need for user-provided masks. It demonstrates strong performance in generating complex scenarios with interactions between user-specified concepts, including both cartoon and real-world instances.	The current method focuses on generating interactions between a limited number of concepts (up to three). Future work could explore extending this framework to handle a wider array of interactions and more complex compositions.	text-to-image generation, personalization, diffusion models, concept disentanglement, cross-attention
2402.11303 Report	FViT: A Focal Vision Transformer with Gabor Filter	Yulong Shi, Mingwei Sun, Yongshuai Wang, Rui Wang, Hui Sun, Zengqiang Chen	Vision transformers have achieved encouraging progress in various computer vision tasks. A common belief is that this is attributed to the competence of self-attention in modeling the global dependencies among feature tokens. Unfortunately, self-attention still faces some challenges in dense prediction tasks, such as the high computational complexity and absence of desirable inductive bias. To address these issues, we revisit the potential benefits of integrating vision transformer with Gabor filter, and propose a Learnable Gabor Filter (LGF) by using convolution. As an alternative to self-attention, we employ LGF to simulate the response of simple cells in the biological visual system to input images, prompting models to focus on discriminative feature representations of targets from various scales and orientations. Additionally, we design a Bionic Focal Vision (BFV) block based on the LGF. This block draws inspiration from neuroscience and introduces a Multi-Path Feed Forward Network (MPFFN) to emulate the working way of biological visual cortex processing information in parallel. Furthermore, we develop a unified and efficient pyramid backbone network family called Focal Vision Transformers (FViTs) by stacking BFV blocks. Experimental results show that FViTs exhibit highly competitive performance in various vision tasks. Especially in terms of computational efficiency and scalability, FViTs show significant advantages compared with other counterparts. Code is available at https://github.com/nkusyl/FViT	This paper proposes Focal Vision Transformers (FViTs), a family of efficient vision backbone networks that replace self-attention with a Learnable Gabor Filter (LGF) and introduce a Multi-Path Feed Forward Network (MPFFN) inspired by neuroscience.	Self-attention in vision transformers suffers from high computational complexity, lack of local sensitivity, and absence of inductive bias. FViTs aim to address these issues by providing an efficient and scalable alternative.	The paper designs LGF using convolution to simulate simple cell responses in the visual system. MPFFN emulates parallel information processing in the visual cortex. These components are combined in a hierarchical pyramid backbone network.	FViTs achieve competitive performance on ImageNet classification compared to CNNs and vision transformers, demonstrating a good balance between accuracy and efficiency. Experiments on COCO object detection and instance segmentation show FViTs outperform ResNet and achieve competitive results with PVT and PoolFormer. Evaluations on ADE20K semantic segmentation task further confirm the effectiveness of FViTs in dense prediction tasks.	The paper primarily focuses on image-level tasks and could explore more challenging video-related tasks. Further investigation into the combination of LGF and self-attention for potential synergistic effects is warranted.	vision transformer, gabor filter, image classification, object detection, semantic segmentation
2402.11281 Report	Can Large Multimodal Models Uncover Deep Semantics Behind Images?	Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui	Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision).Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis indicates that the integration of description texts during the inference process notably enhances LMMs' ability to perceive deep semantics. Furthermore, our dataset is divided into multiple categories, and we conducted a more detailed analysis within these categories.	This paper introduces \method{}, a benchmark designed to assess the capabilities of Large Multimodal Models (LMMs) in understanding the deep semantics of images.	Existing research primarily focuses on superficial image descriptions, neglecting the crucial aspect of deep semantic understanding, which is vital for comprehending the deeper meaning and message conveyed in visual content.	\method{} comprises a human-annotated dataset of cartoon images and three progressive subtasks: Fine-grained Description Selection, In-depth Title Matching, and Deep Semantics Understanding. The authors evaluate nine open-source LMMs and GPT-4V(ision) using these tasks.	There's a significant gap between the deep semantic comprehension abilities of current LMMs and humans. Integrating description texts during inference notably improves LMMs' ability to perceive deep semantics. LMMs exhibit varying performance across different image categories, with certain categories, such as 'Satirical,' posing greater challenges.	The dataset is limited in terms of image categories and currently only includes cartoons. Images with potentially controversial deep semantics are excluded to ensure annotator consensus.	large multimodal models, deep semantics, image understanding, benchmarking, visual reasoning
2402.11248 Report	CoLLaVO: Crayon Large Language and Vision mOdel	Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro	The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from `what objects are in the image?' or `which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.	The paper introduces CoLLaVO, a new large language and vision model that leverages a novel visual prompt tuning scheme called Crayon Prompt and a learning strategy called Dual QLoRA to significantly enhance object-level image understanding and achieve state-of-the-art zero-shot performance on various vision language tasks.	Current Vision Language Models (VLMs) often lack sufficient object-level image understanding, which limits their performance on complex vision language tasks. This paper aims to address this issue by improving the object recognition and understanding capabilities of VLMs.	The authors propose two key techniques: 1) Crayon Prompt: Inspired by panoptic segmentation maps, this method injects object-level semantic and numbering information into image embedding features at every attention layer. 2) Dual QLoRA: This learning strategy utilizes two QLoRA modules to efficiently train the model on both crayon instructions for object-level understanding and visual instruction tuning datasets for complex VL tasks, preventing catastrophic forgetting.	CoLLaVO achieves state-of-the-art zero-shot performance on various VL benchmarks, including MME, MM-Bench, MM-Bench-Chinese, and Q-Bench. The Crayon Prompt, particularly the semantic embedding component, significantly improves object-level image understanding, as demonstrated by improved scores on tasks like MME-P. Dual QLoRA effectively integrates both crayon instructions and visual instruction tuning datasets, leading to superior performance compared to using either approach alone.	The performance of Crayon Prompts relies on the accuracy and object class coverage of the external panoptic segmentation model. Future work includes exploring the integration of diverse visual prompts from various sources like object classification, captioning models, and open-object detection.	vision language models, object-level image understanding, visual prompt tuning, crayon prompt, dual qlora
2402.11148 Report	Knowledge Distillation Based on Transformed Teacher Matching	Kaixiang Zheng, En-Hui Yang	As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.	This paper introduces Transformed Teacher Matching (TTM), a knowledge distillation (KD) variant that removes temperature scaling from the student model, leading to improved generalization due to an inherent Rényi entropy regularization.	This work provides a novel understanding of temperature scaling in KD, showing it's better to apply it only to the teacher model. This leads to improved generalization and provides a new theoretical framework for KD.	The authors reinterpret temperature scaling as a probability distribution power transform. By removing temperature scaling from the student in KD, they derive TTM and show it embeds a Rényi entropy regularizer, improving generalization. They further enhance TTM with sample-adaptive weighting, resulting in Weighted TTM (WTTM).	TTM consistently outperforms KD in image classification tasks on CIFAR-100 and ImageNet. WTTM further improves upon TTM by adaptively weighting the distillation loss based on sample difficulty. WTTM achieves state-of-the-art accuracy, even surpassing many complex feature-based distillation methods.	The selection of the sample-adaptive weight in WTTM could be further optimized. Exploration of alternative probability distribution transforms beyond the power transform could yield additional benefits.	knowledge distillation, temperature scaling, rényi entropy, regularization, image classification
2402.10882 Report	Universal Prompt Optimizer for Safe Text-to-Image Generation	Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang	Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal prompt optimizer for safe T2I (POSI) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at https://github.com/wzongyu/POSI.	This paper proposes POSI, the first universal prompt optimizer for safe Text-to-Image (T2I) generation in a black-box scenario. POSI revises potentially harmful prompts to generate safe images while preserving semantic content.	Existing safety measures for T2I models, like image checkers and model fine-tuning, have limitations in real-world applications. POSI offers a universal and flexible solution for enhancing the safety of black-box T2I models without requiring access to their internal structure.	The methodology involves: (1) Constructing a toxic-clean prompt pairs dataset using GPT-3.5 Turbo. (2) Supervised fine-tuning (SFT) of a language model (LLaMA) on the dataset for basic prompt rewriting. (3) Designing a novel reward function that considers both the toxicity (using Q16 classifier) and text alignment (using CLIP similarity) of generated images. (4) Further training the language model using Proximal Policy Optimization (PPO) to maximize the reward and improve safe image generation.	POSI effectively reduces the likelihood of generating inappropriate images across various T2I models, including SD versions and black-box models like DALL-E 3 and Midjourney. It maintains good text alignment, ensuring the generated images stay relevant to the user's original (though potentially harmful) prompt. The framework is flexible and can be combined with existing safety methods like SLD and SD-NP to further enhance their effectiveness.	Balancing the trade-off between image safety and text alignment remains a challenge. Constructing datasets tailored to produce inappropriate images on specific T2I models like DALL-E 3 and Midjourney is crucial for future research and algorithm development.	text-to-image generation, safe ai, prompt engineering, reinforcement learning, black-box optimization
2402.10855 Report	Control Color: Multimodal Diffusion-based Interactive Image Colorization	Zhexin Liang, Zhaochen Li, Shangchen Zhou, Chongyi Li, Chen Change Loy	Despite the existence of numerous colorization methods, several limitations still exist, such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow. To solve these issues, we introduce Control Color (CtrlColor), a multi-modal colorization method that leverages the pre-trained Stable Diffusion (SD) model, offering promising capabilities in highly controllable interactive image colorization. While several diffusion-based methods have been proposed, supporting colorization in multiple modalities remains non-trivial. In this study, we aim to tackle both unconditional and conditional image colorization (text prompts, strokes, exemplars) and address color overflow and incorrect color within a unified framework. Specifically, we present an effective way to encode user strokes to enable precise local color manipulation and employ a practical way to constrain the color distribution similar to exemplars. Apart from accepting text prompts as conditions, these designs add versatility to our approach. We also introduce a novel module based on self-attention and a content-guided deformable autoencoder to address the long-standing issues of color overflow and inaccurate coloring. Extensive comparisons show that our model outperforms state-of-the-art image colorization methods both qualitatively and quantitatively.	CtrlColor, a novel multi-modal diffusion-based colorization framework is proposed, which unifies unconditional, prompt-, stroke-, and exemplar-based image colorization in a single framework.	Existing colorization methods have limitations such as lack of user interaction, inflexibility in local colorization, unnatural color rendering, insufficient color variation, and color overflow.	The framework leverages the pre-trained Stable Diffusion model, introduces a novel module for stroke encoding, employs a method to constrain color distribution similar to exemplars, and utilizes self-attention guidance and a content-guided deformable autoencoder to address color overflow and inaccurate coloring.	CtrlColor outperforms state-of-the-art methods in terms of color richness, stability, and visual quality. The method effectively addresses color overflow and miscoloring issues. It offers highly precise and flexible control, enabling users to modify image color locally using strokes.	Region coloring may not generate very colorful results for small regions in grayscale images. Exemplar-based colorization might not perfectly replicate complex color distributions from exemplars.	image colorization, diffusion models, multi-modal learning, stable diffusion, interactive image editing
2402.10821 Report	Training Class-Imbalanced Diffusion Model Via Overlap Optimization	Divin Yan, Lu Qi, Vincent Tao Hu, Ming-Hsuan Yang, Meng Tang	Diffusion models have made significant advances recently in high-quality image synthesis and related tasks. However, diffusion models trained on real-world datasets, which often follow long-tailed distributions, yield inferior fidelity for tail classes. Deep generative models, including diffusion models, are biased towards classes with abundant training images. To address the observed appearance overlap between synthesized images of rare classes and tail classes, we propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes. We show variants of our probabilistic contrastive learning method can be applied to any class conditional diffusion model. We show significant improvement in image synthesis using our loss for multiple datasets with long-tailed distribution. Extensive experimental results demonstrate that the proposed method can effectively handle imbalanced data for diffusion-based generation and classification models. Our code and datasets will be publicly available at https://github.com/yanliang3612/DiffROP.	This paper proposes DiffROP, a novel framework to train class-imbalanced diffusion models by minimizing distribution overlap between head and tail classes using probabilistic contrastive learning.	Diffusion models trained on real-world, long-tailed datasets often generate low-fidelity images for tail classes due to bias towards data-abundant head classes.	The method introduces a probabilistic contrastive learning (PCL) loss to penalize the KL divergence between conditional image distributions of different classes, effectively minimizing it using estimated noise from image pairs.	DiffROP significantly improves FID scores and other metrics on CIFAR10LT and CIFAR100LT datasets, indicating better image fidelity and diversity. The method consistently enhances performance across different class categories, particularly for tail classes, showing its robustness to dataset imbalances. Integrating DiffROP for data augmentation in long-tailed classification tasks leads to notable improvements in accuracy, precision, and recall.	The study primarily focuses on image synthesis; further exploration is needed for other data modalities. Fine-tuning the classifier-free guidance strength (ω) is crucial for optimal performance and requires careful consideration.	diffusion models, class imbalance, long-tailed distribution, probabilistic contrastive learning, image synthesis
2402.10739 Report	PointMamba: A Simple State Space Model for Point Cloud Analysis	Dingkang Liang, Xin Zhou, Xinyu Wang, Xingkui Zhu, Wei Xu, Zhikang Zou, Xiaoqing Ye, Xiang Bai	Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity and is difficult to extend to long sequence modeling due to limited computational resources and so on. Recently, state space models (SSM), a new family of deep sequence models, have presented great potential for sequence modeling in NLP tasks. In this paper, taking inspiration from the success of SSM in NLP, we propose PointMamba, a framework with global modeling and linear complexity. Specifically, by taking embedded point patches as input, we proposed a reordering strategy to enhance SSM's global modeling ability by providing a more logical geometric scanning order. The reordered point tokens are then sent to a series of Mamba blocks to causally capture the point cloud structure. Experimental results show our proposed PointMamba outperforms the transformer-based counterparts on different point cloud analysis datasets, while significantly saving about 44.3% parameters and 25% FLOPs, demonstrating the potential option for constructing foundational 3D vision models. We hope our PointMamba can provide a new perspective for point cloud analysis. The code is available at https://github.com/LMD0311/PointMamba.	This paper introduces PointMamba, a novel state space model (SSM) designed for point cloud analysis, achieving global modeling capabilities with linear complexity, making it a potential cornerstone for 3D vision foundation models.	Existing Transformer-based models, while effective for point cloud analysis, suffer from quadratic complexity, hindering their scalability to long sequences. PointMamba addresses this limitation by leveraging the efficiency of SSMs while maintaining global receptive fields.	PointMamba utilizes a point tokenizer to generate point tokens from input point clouds. A reordering strategy then organizes these tokens based on geometric coordinates, facilitating causal structure capturing by the subsequent Mamba blocks. The model is pre-trained using an asymmetric autoencoder with a masked point reconstruction objective.	PointMamba demonstrates competitive performance against Transformer-based counterparts on ModelNet40 and ShapeNetPart datasets, achieving comparable or superior accuracy with significantly reduced parameters and FLOPs. It outperforms Point-MAE in various ScanObjectNN benchmark tasks, showcasing its robustness in real-world object classification. The model exhibits superior memory efficiency for processing lengthy sequences compared to ViT-based approaches, making it suitable for large-scale point cloud analysis.	The current reordering strategy, while effective, involves tripling the sequence length, which may limit the model's capacity to handle extremely long sequences. The pre-training strategy adopted from Point-MAE is not specifically tailored for the unidirectional nature of SSMs, leaving room for further optimization.	point cloud analysis, state space model, mamba, global modeling, linear complexity
2402.10636 Report	PEGASUS: Personalized Generative 3D Avatars with Composable Attributes	Hyunsoo Cha, Byungjun Kim, Hanbyul Joo	We present PEGASUS, a method for constructing a personalized generative 3D face avatar from monocular video sources. Our generative 3D avatar enables disentangled controls to selectively alter the facial attributes (e.g., hair or nose) while preserving the identity. Our approach consists of two stages: synthetic database generation and constructing a personalized generative avatar. We generate a synthetic video collection of the target identity with varying facial attributes, where the videos are synthesized by borrowing the attributes from monocular videos of diverse identities. Then, we build a person-specific generative 3D avatar that can modify its attributes continuously while preserving its identity. Through extensive experiments, we demonstrate that our method of generating a synthetic database and creating a 3D generative avatar is the most effective in preserving identity while achieving high realism. Subsequently, we introduce a zero-shot approach to achieve the same goal of generative modeling more efficiently by leveraging a previously constructed personalized generative model.	PEGASUS is a novel method for creating personalized, generative 3D face avatars from monocular videos, allowing for disentangled control over facial attributes (e.g., hair, nose) while preserving identity.	Personalized and controllable 3D avatars are important for various applications, including AR/VR and the metaverse. Existing methods often lack the ability to alter facial attributes or struggle to maintain identity.	The method involves two stages: (1) generating a synthetic database of the target individual with varying facial attributes by swapping parts from other videos, and (2) training a personalized generative 3D avatar model using this database. Additionally, a zero-shot transfer approach leverages previously constructed models for efficient avatar creation.	PEGASUS outperforms baseline methods in preserving identity and naturalness when transferring hairstyles. The synthetic database generation with part-swapping leads to better generative performance compared to using original videos directly. The zero-shot transfer approach efficiently creates personalized avatars without additional training, showing high identity preservation.	The quality of generated avatars does not yet reach photorealistic levels and exhibits artifacts. The reliance on non-physical-based methods for synthetic database generation limits physical accuracy.	3d face avatar, generative model, personalized avatar, part swapping, zero-shot transfer
2402.10491 Report	Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation	Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen	Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.	This paper presents a novel self-cascade diffusion model for rapid adaptation of pre-trained models to higher resolutions for image and video generation.	Existing diffusion models face challenges in generating images of varying sizes due to single-scale training data, and adapting them to higher resolutions is computationally expensive and often results in poor composition and generation quality.	The method utilizes a pivot-guided noise re-scheduling strategy to progressively synthesize higher resolution images by reusing the knowledge from a well-trained low-resolution model. It introduces lightweight, learnable upsampling modules to further improve the adaptation with minimal fine-tuning on a small amount of high-resolution data.	The approach achieves a 5x training speed-up compared to full fine-tuning and requires only 0.002M additional parameters. It demonstrates state-of-the-art performance in both tuning-free and tuning settings across various scale adaptations for both image and video generation. The method efficiently adapts to higher resolutions with minimal additional inference time.	The performance of the method may be limited when the scale gap is too large due to the small number of parameters in the upsampling modules. Future work will explore the trade-off between adaptation efficiency and generalization ability.	diffusion models, image generation, video generation, resolution adaptation, self-cascade
2402.10401 Report	ManiFPT: Defining and Analyzing Fingerprints of Generative Models	Hae Jin Song, Mahyar Khayatkhoei, Wael AbdAlmageed	Recent works have shown that generative models leave traces of their underlying generative process on the generated samples, broadly referred to as fingerprints of a generative model, and have studied their utility in detecting synthetic images from real ones. However, the extend to which these fingerprints can distinguish between various types of synthetic image and help identify the underlying generative process remain under-explored. In particular, the very definition of a fingerprint remains unclear, to our knowledge. To that end, in this work, we formalize the definition of artifact and fingerprint in generative models, propose an algorithm for computing them in practice, and finally study its effectiveness in distinguishing a large array of different generative models. We find that using our proposed definition can significantly improve the performance on the task of identifying the underlying generative process from samples (model attribution) compared to existing methods. Additionally, we study the structure of the fingerprints, and observe that it is very predictive of the effect of different design choices on the generative process.	This work proposes a formal definition of fingerprints in generative models, based on the deviation of generated samples from the true data manifold.	Identifying the source of synthetic data is crucial for various applications, including differentiating authorized from malicious personification and detecting digital copyright infringement. Existing works lack a clear definition of generative model fingerprints, hindering systematic study and comparison.	The authors define an artifact as the difference between a generated sample and its closest point on the true data manifold. The fingerprint of a generative model is then defined as the set of all its artifacts. They propose an algorithm to compute these artifacts by estimating the data manifold from real samples in a chosen embedding space (RGB, Frequency, or learned spaces).	The proposed fingerprint definition, when used as features for model attribution, outperforms existing methods on four different datasets. Analysis of feature spaces shows that the proposed fingerprint representations exhibit better separability compared to baselines. The clustering structure of the fingerprints reveals a strong alignment with the choice of upsampling methods and loss functions used in generative models, confirming common intuitions about model limitations.	The estimation of the true data manifold relies on finite samples, which might not perfectly represent the underlying manifold. Future work includes investigating the impact of different embedding spaces and distance metrics on fingerprint quality and exploring techniques to improve manifold estimation.	generative models, fingerprinting, model attribution, data manifold, deep learning
2402.10294 Report	LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing	Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, Raj Sodhi	Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.	This paper presents LAVE, a video editing tool that integrates Large Language Models (LLMs) to provide agent assistance and language-augmented editing features, aiming to lower editing barriers for beginners and enhance the editing workflow.	Video creation is popular, but the complexity of editing poses challenges for beginners. LAVE addresses these challenges by leveraging LLMs' linguistic capabilities to assist users throughout the editing process, from ideation to execution.	LAVE combines a language-augmented video gallery with an LLM-based plan-and-execute agent. It automatically generates textual descriptions of videos, enabling the agent to understand and manipulate them based on user instructions.	User study participants successfully produced videos using LAVE and found it enjoyable and efficient. Users appreciated the novelty of LAVE's language-driven interaction and its potential to democratize video editing. The study revealed varying preferences for agent assistance, emphasizing the need for adaptive support tailored to user needs and task types.	LAVE's current agent design and editing functions could be further enhanced, for example, by incorporating multi-agent systems and more fine-grained editing controls. Future work can address limitations related to LLM capabilities, such as the limited context window and occasional factual inaccuracies.	video editing, llms, agents, human-ai co-creation, language augmentation
2402.10210 Report	Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation	Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, Quanquan Gu	Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI), especially when compared with the remarkable progress made in fine-tuning Large Language Models (LLMs). While cutting-edge diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised fine-tuning, their performance inevitably plateaus after seeing a certain volume of data. Recently, reinforcement learning (RL) has been employed to fine-tune diffusion models with human preference data, but it requires at least two images ("winner" and "loser" images) for each text prompt. In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion), where the diffusion model engages in competition with its earlier versions, facilitating an iterative self-improvement process. Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment. Our experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms the existing supervised fine-tuning method in aspects of human preference alignment and visual appeal right from its first iteration. By the second iteration, it exceeds the performance of RLHF-based methods across all metrics, achieving these results with less data.	This paper introduces SPIN-Diffusion, a novel self-play fine-tuning method for diffusion models that effectively utilizes datasets with only one image per text prompt.	Fine-tuning diffusion models like Stable Diffusion often plateaus with limited data, and existing reinforcement learning methods require multiple images per prompt, limiting their applicability to common datasets.	SPIN-Diffusion leverages a self-play mechanism where the diffusion model competes against its earlier versions, iteratively improving its performance through a decomposed training objective based on differentiating and deceiving the test function.	SPIN-Diffusion outperforms supervised fine-tuning and existing Diffusion-DPO methods in human preference alignment and visual appeal. The method surpasses baselines on the Pick-a-Pic dataset, achieving superior scores in metrics like PickScore and Aesthetic score. Theoretical analysis shows SPIN-Diffusion's stationary point aligns with the target data distribution, outperforming traditional supervised fine-tuning.	The paper mainly focuses on text-to-image generation, and further investigation is needed for other applications of diffusion models. Future work could explore incorporating human feedback during the fine-tuning process to further enhance performance.	diffusion models, self-play fine-tuning, text-to-image generation, generative ai, stable diffusion
2402.10208 Report	Recovering the Pre-Fine-Tuning Weights of Generative Models	Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen	The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.	This paper identifies a new vulnerability in LoRA fine-tuned models, enabling the recovery of pre-fine-tuning weights using multiple models fine-tuned from the same source.	This vulnerability poses significant security and safety risks, as it allows access to potentially unsafe pre-trained models even after alignment fine-tuning.	The authors propose Spectral DeTuning, an iterative low-rank matrix factorization method that exploits the low-rank nature of LoRA updates to recover the original weights.	Spectral DeTuning successfully recovers pre-fine-tuning weights with high precision on various models, including ViT, Stable Diffusion, and Mistral-7B. The method effectively reverses alignment training in Mistral-7B, restoring pre-fine-tuning generation capabilities. Stable Diffusion LoRAs obtained from online marketplaces are also vulnerable, demonstrating the real-world applicability of this attack.	Spectral DeTuning requires several LoRA models with the same rank to be effective. The paper primarily focuses on LoRA and does not address other fine-tuning techniques.	model security, lora, fine-tuning, weight recovery, alignment attack
2402.10193 Report	BitDelta: Your Fine-Tune May Only Be Worth One Bit	James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai	Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.	\oursmethod quantizes the weight delta between fine-tuned and pre-trained LLMs down to 1 bit without hurting performance.	Storing and serving numerous fine-tuned LLMs is expensive. \oursmethod addresses this by compressing the fine-tuning information (the delta) significantly.	\oursmethod first quantizes the delta to 1-bit by taking the sign. It then calibrates per-matrix scaling factors via distillation on a small dataset.	Quantizing the delta to 1-bit leads to over 10x compression. \oursmethod achieves comparable performance to the original fine-tuned models across various tasks, model families (Llama-2, Mistral), and sizes (7B-70B). Preliminary results with a custom Triton kernel show that \oursmethod can lead to a 2x speedup in multi-tenant serving latency.	The current implementation of the efficient inference kernel is not fully optimized. Potential alignment degradation due to the lossy compression of fine-tuning information needs further investigation.	model compression, quantization, large language models, multi-tenant serving, parameter-efficient fine-tuning
2402.10093 Report	MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations	Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter	We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to diverse intermediate layers. In each head, a modified nearest neighbor objective helps to construct respective semantic clusters. The refinement process is short but effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, achieves new state-of-the-art results in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. In ImageNet-1K 1-shot classification, MIM-Refiner sets a new state-of-the-art of 64.2%, outperforming larger models that were trained on up to 2000x more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B. Project page: https://ml-jku.github.io/MIM-Refiner	Introduces MIM-Refiner, a method using contrastive learning to boost pre-trained Masked Image Modeling (MIM) models by leveraging representations in intermediate layers.	MIM models often have subpar representations in later encoder blocks due to their lightweight decoders, limiting downstream task performance.	MIM-Refiner attaches multiple contrastive heads to intermediate encoder blocks, including those with peak representation quality. It employs Nearest Neighbor Alignment (NNA), aligning each sample with its nearest neighbor while repelling others, to enforce semantic clusters.	Refined MIM models achieve state-of-the-art linear probing (84.7%) and low-shot classification on ImageNet-1K among models pre-trained on the same dataset. MIM-Refiner advances ImageNet-1K 1-shot classification to 64.2%, surpassing larger models trained on significantly more data. Significantly improved clustering performance, as measured by metrics like ACC and NMI, indicating better-defined semantic clusters.	Reliance on batch normalization in ID heads limits scalability to distributed setups. Exploration of alternative solutions to the lightweight decoder issue, such as larger decoders or different training schemes.	self-supervised learning, masked image modeling, contrastive learning, instance discrimination, vision transformer
2402.09966 Report	Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation	Junjie Shentu, Matthew Watson, Noura Al Moubayed	Subject-driven text-to-image diffusion models empower users to tailor the model to new concepts absent in the pre-training dataset using a few sample images. However, prevalent subject-driven models primarily rely on single-concept input images, facing challenges in specifying the target concept when dealing with multi-concept input images. To this end, we introduce a textual localized text-to-image model (Texual Localization) to handle multi-concept input images. During fine-tuning, our method incorporates a novel cross-attention guidance to decompose multiple concepts, establishing distinct connections between the visual representation of the target concept and the identifier token in the text prompt. Experimental results reveal that our method outperforms or performs comparably to the baseline models in terms of image fidelity and image-text alignment on multi-concept input images. In comparison to Custom Diffusion, our method with hard guidance achieves CLIP-I scores that are 7.04%, 8.13% higher and CLIP-T scores that are 2.22%, 5.85% higher in single-concept and multi-concept generation, respectively. Notably, our method generates cross-attention maps consistent with the target concept in the generated images, a capability absent in existing models.	This paper introduces \textit{Textual Localization}, a subject-driven text-to-image model designed to handle multi-concept input images for generating customized images containing new concepts.	Existing subject-driven text-to-image models struggle to specify target concepts within multi-concept images, often generating all concepts present in the input.	\textit{Textual Localization} incorporates cross-attention guidance during fine-tuning to decompose multi-concept images, establishing distinct connections between the visual representation of the target concept and its identifier token in the text prompt. Two guidance strategies are explored: hard and soft guidance.	The method outperforms or performs comparably to baseline models in terms of image fidelity and image-text alignment in both single-concept and multi-concept generation. Hard guidance proves particularly effective for multi-concept generation, achieving superior image fidelity and accurately outlining target concepts in cross-attention maps. Optimizing specific parameters (Wk and Wv matrices in cross-attention layers) is identified as crucial for balancing visual representation learning and semantic knowledge preservation.	The model exhibits limitations in capturing intricate details of target concepts. Future work will focus on enhancing detail capture, potentially by using more powerful feature extractors, and improving multi-concept generation success rates via guiding techniques during inference.	text-to-image generation, subject-driven generation, diffusion models, cross-attention guidance, multi-concept images
2402.09872 Report	Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community	Arman Isajanyan, Artur Shatveryan, David Kocharyan, Zhangyang Wang, Humphrey Shi	Social reward as a form of community recognition provides a strong source of motivation for users of online platforms to engage and contribute with content. The recent progress of text-conditioned image synthesis has ushered in a collaborative era where AI empowers users to craft original visual artworks seeking community validation. Nevertheless, assessing these models in the context of collective community preference introduces distinct challenges. Existing evaluation methods predominantly center on limited size user studies guided by image quality and prompt alignment. This work pioneers a paradigm shift, unveiling Social Reward - an innovative reward modeling framework that leverages implicit feedback from social network users engaged in creative editing of generated images. We embark on an extensive journey of dataset curation and refinement, drawing from Picsart: an online visual creation and editing platform, yielding a first million-user-scale dataset of implicit human preferences for user-generated visual art named Picsart Image-Social. Our analysis exposes the shortcomings of current metrics in modeling community creative preference of text-to-image models' outputs, compelling us to introduce a novel predictive model explicitly tailored to address these limitations. Rigorous quantitative experiments and user study show that our Social Reward model aligns better with social popularity than existing metrics. Furthermore, we utilize Social Reward to fine-tune text-to-image models, yielding images that are more favored by not only Social Reward, but also other established metrics. These findings highlight the relevance and effectiveness of Social Reward in assessing community appreciation for AI-generated artworks, establishing a closer alignment with users' creative goals: creating popular visual art. Codes can be accessed at https://github.com/Picsart-AI-Research/Social-Reward	This work introduces "Social Reward", a novel reward modeling framework for text-to-image synthesis that leverages implicit feedback from social network users engaged in creative editing of generated images.	Assessing text-to-image models in the context of collective community preference is crucial, especially for creative editing, but existing methods are limited by small user studies focused on image quality and prompt alignment.	The authors curate a million-user-scale dataset of implicit human preferences from Picsart, a visual creation platform, and develop a predictive model that leverages collective implicit feedback from users who employ generated images for creative purposes.	Existing metrics fall short in capturing community-level creative preference for text-to-image model outputs. The Social Reward model outperforms existing metrics in predicting social popularity of generated images for creative editing. Fine-tuning text-to-image models with Social Reward improves alignment with both Social Reward and other established metrics.	Social Reward is currently focused on creative editing and might not generalize to other domains. Further research is needed to investigate the impact of specific editing actions on Social Reward.	text-to-image synthesis, reward modeling, social network popularity, creative editing, human preference learning
2402.09812 Report	DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization	Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungryong Kim, Seunggyu Chang	The objective of text-to-image (T2I) personalization is to customize a diffusion model to a user-provided reference concept, generating diverse images of the concept aligned with the target prompts. Conventional methods representing the reference concepts using unique text embeddings often fail to accurately mimic the appearance of the reference. To address this, one solution may be explicitly conditioning the reference images into the target denoising process, known as key-value replacement. However, prior works are constrained to local editing since they disrupt the structure path of the pre-trained T2I model. To overcome this, we propose a novel plug-in method, called DreamMatcher, which reformulates T2I personalization as semantic matching. Specifically, DreamMatcher replaces the target values with reference values aligned by semantic matching, while leaving the structure path unchanged to preserve the versatile capability of pre-trained T2I models for generating diverse structures. We also introduce a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions introduced by the target prompts. Compatible with existing T2I models, DreamMatcher shows significant improvements in complex scenarios. Intensive analyses demonstrate the effectiveness of our approach.	DreamMatcher is a novel plug-in method for text-to-image (T2I) personalization that enhances subject appearance by transferring reference appearance while preserving diverse structures guided by target prompts.	Existing T2I personalization methods often fail to accurately mimic the appearance of subjects, especially in complex non-rigid scenarios, due to the limited expressivity of text embeddings or disruptions to the target structure path.	DreamMatcher leverages semantic matching within a reference-target dual-branch framework. It utilizes appearance matching self-attention (AMA) to align reference appearance with the target structure while maintaining the pre-trained structure path. It also introduces semantic matching guidance to enhance fine-grained subject details and a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions.	DreamMatcher significantly improves subject fidelity compared to existing T2I personalization methods, including Textual Inversion, DreamBooth, and CustomDiffusion, while effectively preserving prompt fidelity. DreamMatcher outperforms previous tuning-free plug-in methods, such as FreeU and MagicFusion, and even a learnable method, ViCo, in both quantitative metrics and user studies. Ablation studies confirm the effectiveness of each component, highlighting the importance of semantic matching, consistent masking, and matching guidance for achieving high-fidelity personalization.	DreamMatcher may not handle stylization prompts that are not present in the reference images. The personalization quality can be affected by the selection of reference images, with reference images containing richer visual attributes leading to better results.	text-to-image personalization, diffusion models, semantic matching, appearance transfer, plug-in methods
2402.09712 Report	Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement	Tao Yang, Cuiling Lan, Yan Lu, Nanning zheng	Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image to a set of concept tokens and treat them as the condition of the latent diffusion for image reconstruction, where cross-attention over the concept tokens is used to bridge the interaction between the encoder and diffusion. Without any additional regularization, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analysis, shedding light on the functioning of this model. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs. We anticipate that our findings will inspire more investigation on exploring diffusion for disentangled representation learning towards more sophisticated data analysis and understanding.	This paper, for the first time, shows that diffusion models with cross-attention can serve as a strong inductive bias for learning disentangled representations.	Disentangled representation learning is crucial for enhancing interpretability, generalizability, and controllability of machine learning models but remains a challenging task.	The paper proposes EncDiff, a simple framework where an image encoder transforms an image into concept tokens, which condition a latent diffusion model with cross-attention for image reconstruction.	EncDiff achieves state-of-the-art disentanglement performance on benchmark datasets, outperforming previous methods with complex designs. Ablation studies confirm that both diffusion modeling and cross-attention interaction are crucial for the disentanglement capability. Visualization analysis provides insights into the alignment between learned concept tokens and spatial features, verifying the disentanglement effectiveness.	While effective on simple datasets, achieving satisfactory disentanglement on complex data remains a challenge. Although faster than some diffusion-based methods, the generation speed is still slower compared to VAE-based and GAN-based methods, requiring more efficient sampling strategies in the future.	disentangled representation learning, diffusion models, cross-attention, inductive bias, unsupervised learning
2402.09368 Report	Magic-Me: Identity-Specific Video Customized Diffusion	Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, Jiashi Feng	Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at https://github.com/Zhen-Dong/Magic-Me.	This paper proposes Video Custom Diffusion (VCD), a novel framework for generating high-quality, identity-specific videos.	Creating videos with specific identities is challenging due to the difficulty of maintaining identity features and motion consistency across frames.	VCD uses a three-stage approach: 1) T2V VCD generates initial low-resolution videos, 2) Face VCD enhances facial features, and 3) Tiled VCD upscales the video while preserving identity. A 3D Gaussian Noise Prior ensures stable motion, and an extended Textual Inversion-based ID module preserves identity.	VCD generates videos with better identity preservation and text alignment compared to baselines. The 3D Gaussian Noise Prior significantly improves temporal consistency in generated videos. The ID module effectively disentangles identity information while aligning with user prompts.	VCD faces challenges in generating videos with multiple interacting identities. The framework is currently limited to generating short videos due to the motion module's capacity.	video generation, diffusion models, identity customization, text-to-video, motion consistency
2402.09240 Report	Switch EMA: A Free Lunch for Better Flatness and Sharpness	Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, Stan Z. Li	Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.	This paper introduces Switch Exponential Moving Average (SEMA), a novel weight averaging method that enhances deep neural network optimization by dynamically switching between a fast model and a slow, exponentially averaged model during training.	The proposed SEMA method aims to overcome limitations of existing weight averaging techniques by combining the fast convergence of EMA with the ability to explore sharper, potentially better, local minima, thus improving generalization capabilities of DNNs.	SEMA leverages the EMA algorithm but, crucially, switches the model parameters to the EMA-averaged weights after each training epoch, allowing the optimizer to further explore the loss landscape from this new starting point.	SEMA consistently outperforms baseline models and other weight averaging techniques, including EMA and SWA, across a diverse range of tasks, such as image classification, self-supervised learning, object detection, and language modeling. SEMA demonstrates faster convergence speeds compared to traditional training setups and EMA. The paper provides theoretical analysis demonstrating SEMA's ability to reduce low-frequency oscillations and maintain a gradient descent property, contributing to its stability and fast convergence.	The paper primarily focuses on empirical validation of SEMA, leaving further theoretical exploration of its properties and behavior in different optimization landscapes for future work. While the one-epoch switching interval proves effective across various tasks, exploring task-specific optimal intervals might further enhance SEMA's performance.	deep neural networks, optimization, regularization, weight averaging, exponential moving average
2402.09052 Report	L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects	Yutaro Yamada, Khyathi Chandu, Yuchen Lin, Jack Hessel, Ilker Yildirim, Yejin Choi	Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with five legs". In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D mesh generation of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.	This paper introduces L3GO, an inference-time approach that uses language agents with chain-of-3D-thoughts to generate unconventional objects in a 3D environment.	Existing diffusion-based image generation models struggle to accurately generate objects from unconventional descriptions requiring precise 3D spatial understanding. L3GO leverages the reasoning capabilities of LLMs to address this.	L3GO decomposes object creation into iterative part-by-part generation, utilizing LLMs as agents for part specification, spatial reasoning, coordinate calculation, action execution, and critique within a custom Blender environment called 'L3Env'.	L3GO outperforms baseline LLM agents (GPT-4, ReAct-B, Reflexion-B) and achieves higher accuracy in generating 3D meshes on ShapeNet based on human and GPT-4V evaluation. Human evaluation on a new benchmark 'Unconventionally Feasible Objects (UFO)' shows L3GO surpasses state-of-the-art text-to-2D image (DALL-E 3, SDXL) and text-to-3D models (Shap-E). Ablation studies show the importance of the spatial critic and program-based coordinate calculation modules in L3GO.	The quality of LLM-generated 3D meshes is not yet on par with human-designed meshes or those from diffusion-based methods. The object generation process can be time-consuming, particularly for complex objects, highlighting the need for efficiency improvements.	3d object generation, language agents, chain-of-thought, blender, unconventional objects
2402.08960 Report	Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision	Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu	Contemporary cutting-edge open-vocabulary segmentation approaches commonly rely on image-mask-text triplets, yet this restricted annotation is labour-intensive and encounters scalability hurdles in complex real-world scenarios. Although some methods are proposed to reduce the annotation cost with only text supervision, the incompleteness of supervision severely limits the versatility and performance. In this paper, we liberate the strict correspondence between masks and texts by using independent image-mask and image-text pairs, which can be easily collected respectively. With this unpaired mask-text supervision, we propose a new weakly-supervised open-vocabulary segmentation framework (Uni-OVSeg) that leverages confident pairs of mask predictions and entities in text descriptions. Using the independent image-mask and image-text pairs, we predict a set of binary masks and associate them with entities by resorting to the CLIP embedding space. However, the inherent noise in the correspondence between masks and entities poses a significant challenge when obtaining reliable pairs. In light of this, we advocate using the large vision-language model (LVLM) to refine text descriptions and devise a multi-scale ensemble to stablise the matching between masks and entities. Compared to text-only weakly-supervised methods, our Uni-OVSeg achieves substantial improvements of 15.5% mIoU on the ADE20K datasets, and even surpasses fully-supervised methods on the challenging PASCAL Context-459 dataset.	This paper proposes Uni-OVSeg, a novel weakly-supervised open-vocabulary segmentation framework that utilizes unpaired image-mask and image-text pairs for training, significantly reducing the annotation cost associated with traditional image-mask-text triplets.	Open-vocabulary segmentation, crucial for segmenting and categorizing objects from an extensive vocabulary, often relies on expensive and difficult-to-obtain image-mask-text triplets. Uni-OVSeg addresses this challenge by leveraging more readily available unpaired data sources.	Uni-OVSeg consists of a mask generation branch (using a visual prompt encoder, pixel decoder, and mask decoder) to predict binary masks from images. Concurrently, it employs a large vision-language model (LLaVa) for text refinement and a ChatGPT-based parser for entity extraction from image captions. A mask-text bipartite matching aligns predicted masks with text entities, aided by a multi-scale feature adapter and ensemble strategy for robust correspondence.	Uni-OVSeg achieves substantial improvements over weakly-supervised methods, with a 15.5% mIoU gain on ADE20K. It even surpasses fully-supervised methods on the challenging PASCAL Context-459 dataset, demonstrating its strong open-vocabulary capability. In promptable segmentation tasks, Uni-OVSeg consistently outperforms SAM, showcasing its efficacy in interactive segmentation with point and box prompts.	The inherent noise in unpaired mask-text correspondence presents a challenge, although mitigated by the multi-scale ensemble strategy. The use of multiple granularity masks in the image-mask training data impacts performance on panoptic segmentation tasks requiring specific instance differentiation.	open-vocabulary segmentation, weakly-supervised learning, vision-language models, promptable segmentation, multi-scale ensemble
2402.08919 Report	Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding	Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trager, Carson Klingenberg, Stefano Soatto	Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however, determining the degree of similarity between works requires subjective analysis, and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar, whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then, similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description, whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment, and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number, our method also offers interpretability by pointing to the specific level of granularity of the description where the source data are differentiated.	This paper introduces CC:DAE, a novel method for measuring "conceptual similarity" between data samples like images and text, focusing on shared high-level concepts rather than just pixel-level visual similarities.	Defining objective similarity is crucial for copyright in machine learning, but existing methods struggle to capture human-like understanding of conceptual relationships.	CC:DAE generates textual descriptions of increasing complexity for each sample using a pre-trained language model. It then quantifies similarity based on how well descriptions of one sample fit the other at varying complexity levels. A small distance at high complexity signifies high conceptual similarity.	CC:DAE outperforms existing zero-shot methods on text similarity benchmarks, aligning better with human judgments. It surpasses CLIP on image similarity tasks, demonstrating its ability to capture conceptual relations beyond visual features. The method generalizes to cross-modal comparisons, effectively measuring similarity between text and images.	The current implementation relies solely on text descriptions, limiting its ability to capture visual arrangement similarities. Future work could explore incorporating visual features into the description space for a more comprehensive approach.	conceptual similarity, copyright, multi-modal learning, language models, image similarity
2402.08875 Report	Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos	Yang Qian, Yinan Sun, Ali Kargarandehkordi, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington	The increasing variety and quantity of tagged multimedia content on platforms such as TikTok provides an opportunity to advance computer vision modeling. We have curated a distinctive dataset of 283,582 unique video clips categorized under 386 hashtags relating to modern human actions. We release this dataset as a valuable resource for building domain-specific foundation models for human movement modeling tasks such as action recognition. To validate this dataset, which we name TikTokActions, we perform two sets of experiments. First, we pretrain the state-of-the-art VideoMAEv2 with a ViT-base backbone on TikTokActions subset, and then fine-tune and evaluate on popular datasets such as UCF101 and the HMDB51. We find that the performance of the model pre-trained using our Tik-Tok dataset is comparable to models trained on larger action recognition datasets (95.3% on UCF101 and 53.24% on HMDB51). Furthermore, our investigation into the relationship between pre-training dataset size and fine-tuning performance reveals that beyond a certain threshold, the incremental benefit of larger training sets diminishes. This work introduces a useful TikTok video dataset that is available for public use and provides insights into the marginal benefit of increasing pre-training dataset sizes for video-based foundation models.	This paper investigates the use of a large, unlabeled dataset of TikTok videos for pre-training a foundation model (VideoMAEv2) for human action recognition.	This is important because it leverages the diverse and dynamic nature of TikTok videos to improve action recognition in real-world scenarios and challenges the assumption that larger datasets are always better for pre-training.	The authors curated a dataset of over 280,000 TikTok videos, pre-trained VideoMAEv2 on this dataset, and fine-tuned it on established benchmarks (UCF101, HMDB51, Kinetics-400, Something-Something V2).	The fine-tuned model achieves state-of-the-art results on these benchmarks, demonstrating the effectiveness of using TikTok videos for pre-training. The study found that while increasing the pre-training dataset size initially improves performance, the benefits diminish with further increases. This suggests that a well-curated, smaller dataset can sometimes be more effective than a larger, more general one.	The study acknowledges the ethical considerations of using online video data, particularly regarding privacy and informed consent. Future work could explore the use of weekly self-supervised learning to further improve the model's adaptability to dynamic content.	action recognition, foundation models, self-supervised learning, tiktok, video understanding
2402.08714 Report	PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models	Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou	Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.	This paper introduces PRDP, the first black-box reward finetuning method for diffusion models that remains stable even when trained on large-scale datasets (100K+ prompts).	Existing reinforcement learning (RL) based methods for finetuning diffusion models with rewards struggle to scale to large datasets due to instability during training, limiting their ability to generalize to complex and unseen prompts.	PRDP addresses instability by: 1. Converting the RLHF objective into a supervised regression objective called Reward Difference Prediction (RDP), where the model predicts the reward difference between generated image pairs. 2. Employing proximal updates and online optimization to further enhance training stability and generation quality.	PRDP achieves comparable reward maximization to established RL-based methods in small-scale training. PRDP demonstrates superior stability in large-scale training where RL-based methods fail. PRDP generates higher quality images and generalizes better to unseen prompts after large-scale training.	The per-prompt reward normalization, crucial for DDPO's stability, is ineffective in large-scale settings due to limited prompt occurrences. Future work could explore techniques to make reward normalization more effective in large-scale scenarios.	diffusion models, reward finetuning, text-to-image synthesis, reinforcement learning, stable training
2402.08682 Report	IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation	Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos	Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.	Introduces \method, a text/image-to-3D generation approach that leverages iterative multiview diffusion and reconstruction using a video generator network and direct 3D fitting with Gaussian splatting, eliminating the need for Score Distillation Sampling (SDS) and reconstruction networks.	Addresses limitations of SDS-based methods (slow, unstable, artifact-prone) and direct reconstruction methods (limited quality) by improving multi-view generation quality and efficiency.	1. Fine-tune a text-to-video generator (Emu Video) on synthetic 3D data to generate consistent multi-view sequences. 2. Directly fit a 3D Gaussian splatting model to the generated views using robust image-based losses (LPIPS, MS-SSIM). 3. Iteratively refine the 3D model by feeding back rendered views to the video generator.	Significantly reduces the number of 2D generator evaluations compared to SDS (10-100x faster). Achieves state-of-the-art text/image-to-3D generation quality, outperforming existing methods in faithfulness and visual quality. Enables fast and robust 3D reconstruction without requiring training of large reconstruction networks.	Struggles with highly dynamic subjects, sometimes generating spurious animations. Relies on a synthetic dataset for training the multi-view video generator.	text-to-3d, image-to-3d, video generation, gaussian splatting, multi-view consistency
2402.08680 Report	Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance	Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu	The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model's output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs' generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs' generations, as assessed by GPT-4V.	This paper introduces MARINER, a training-free and API-free framework that mitigates object hallucinations in Large Vision-Language Models (LVLMs) during text generation by integrating object grounding features.	Object hallucination, a critical issue in LVLMs where non-existing objects are described, compromises the accuracy and reliability of these models, especially in safety-critical applications.	MARINER enriches the visual context of LVLMs by incorporating object grounding features from a pre-trained object detection model (DETR) and employs classifier-free guidance to control text generation, placing more importance on the enriched visual features.	MARINER significantly reduces object hallucinations as measured by CHAIR and POPE metrics, outperforming existing methods. The framework enhances the detailedness of LVLMs' generations, as assessed by GPT-4V. MARINER strikes a balance between reducing hallucinations, maintaining computational efficiency, and preserving LLM originality.	While the paper demonstrates MARINER with DETR, exploring other advanced vision encoders could further enhance its performance. Further evaluation of MARINER on a wider range of benchmarks would be beneficial.	large vision-language models, object hallucination, classifier-free guidance, object grounding, multi-modal generation
2402.08678 Report	Graph Mamba: Towards Learning on Graphs with State Space Models	Ali Behrouz, Farnoosh Hashemi	Graph Neural Networks (GNNs) have shown promising potential in graph representation learning. The majority of GNNs define a local message-passing mechanism, propagating information over the graph by stacking multiple layers. These methods, however, are known to suffer from two major limitations: over-squashing and poor capturing of long-range dependencies. Recently, Graph Transformers (GTs) emerged as a powerful alternative to Message-Passing Neural Networks (MPNNs). GTs, however, have quadratic computational cost, lack inductive biases on graph structures, and rely on complex Positional/Structural Encodings (SE/PE). In this paper, we show that while Transformers, complex message-passing, and SE/PE are sufficient for good performance in practice, neither is necessary. Motivated by the recent success of State Space Models (SSMs), such as Mamba, we present Graph Mamba Networks (GMNs), a general framework for a new class of GNNs based on selective SSMs. We discuss and categorize the new challenges when adapting SSMs to graph-structured data, and present four required and one optional steps to design GMNs, where we choose (1) Neighborhood Tokenization, (2) Token Ordering, (3) Architecture of Bidirectional Selective SSM Encoder, (4) Local Encoding, and dispensable (5) PE and SE. We further provide theoretical justification for the power of GMNs. Experiments demonstrate that despite much less computational cost, GMNs attain an outstanding performance in long-range, small-scale, large-scale, and heterophilic benchmark datasets.	Presents Graph Mamba Networks (GMNs), a novel graph learning framework based on selective State Space Models (SSMs) like Mamba, to address limitations of Graph Neural Networks (GNNs) and Graph Transformers (GTs) in capturing long-range dependencies and scalability.	GNNs struggle with long-range dependencies and GTs have high computational cost. GMNs offer an efficient and effective alternative.	Introduces a 5-step recipe: (1) Tokenization: mapping the graph into a sequence of node/subgraph tokens. (2) Optional PE/SE: incorporating positional/structural encodings. (3) Local Encoding: encoding local structures around each node. (4) Token Ordering: ordering the sequence of tokens. (5) (Stack of) Bidirectional Mamba: scanning and selectively incorporating relevant nodes/subgraphs into hidden states.	GMNs outperform baselines on benchmarks for long-range, small-scale, large-scale, and heterophilic graph datasets. A variant of GMNs without complex components like Transformers, message-passing, and PE/SE achieves competitive performance, challenging their perceived necessity. GMNs demonstrate superior memory efficiency compared to GTs, particularly on large graphs.	The search space of hyperparameters is not fully explored, relying on a subspace for preliminary results. Future work can investigate the integration of more sophisticated token ordering techniques.	graph neural networks, graph transformers, state space models, mamba, long-range dependencies
2402.08657 Report	PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs	Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano	Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.	This paper introduces PIN (Positional Insert), a lightweight learnable spatial prompt, to unlock zero-shot object localisation abilities in frozen caption-based Vision Language Models (VLMs).	Existing VLMs, primarily trained on image-caption pairs, struggle with object localisation due to a lack of explicit spatial grounding in their training data. This work aims to address this limitation and enhance VLMs' spatial understanding.	PIN is a spatial prompt added to the vision encoder's output, trained on synthetic data with a next-token prediction task to generate bounding box coordinates. This eliminates the need for supervised detection data or architectural changes to the VLM.	PIN significantly improves object localisation in OpenFlamingo and BLIP-2 VLMs, outperforming baselines like in-context learning and other PEFT methods. The method generalizes well to diverse images, including paintings, cartoons, and photos from COCO, PVOC, and LVIS datasets. PIN shows promising zero-shot grounding capabilities on RefCOCO, achieving decent performance without using any annotated training data for this dataset.	The model struggles with tight bounding box generation, especially for small objects, due to the low input resolution and simplistic training. Localising multiple instances of the same object remains a challenge.	vision-language models, object localisation, zero-shot learning, spatial prompt, synthetic data
2402.08654 Report	Learning Continuous 3D Words for Text-to-Image Generation	Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomir Mech, Andrew Markham, Niki Trigoni	Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. Project Page: https://ttchengab.github.io/continuous_3d_words	Introduces 'Continuous 3D Words', special tokens for text-to-image models enabling fine-grained control over continuous 3D attributes like illumination, non-rigid shape change, orientation, and camera parameters.	Current text-based and ControlNet image generation methods struggle to recognize and manipulate abstract, continuous 3D attributes. This work aims to bridge this gap by integrating the precision of 3D control with the accessibility of text-to-image models.	Trains a continuous vocabulary using a two-stage fine-tuning approach. First, Dreambooth learns the object identity from a single 3D mesh. Second, an MLP maps continuous attribute values to token embeddings, disentangling attributes from object identity. ControlNet augmentation with depth/lineart conditions enhances background and texture diversity.	Quantitative user studies demonstrate superior performance of 'Continuous 3D Words' over ControlNet baselines in controlling various attributes. The method generalizes well, enabling attribute control on objects semantically similar to the training mesh. Enables simultaneous control of multiple attributes, enhancing the expressiveness of text-to-image generation.	User study reveals a preference for condition accuracy over physical plausibility in some cases. Current model faces challenges with style transfer from text prompts and occasional overfitting to training mesh attributes.	text-to-image generation, continuous control, 3d attributes, diffusion models, controlnet
2402.08622 Report	NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs	Michael Fischer, Zhengqin Li, Thu Nguyen-Phuoc, Aljaz Bozic, Zhao Dong, Carl Marshall, Tobias Ritschel	A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry and appearance of a scene. We here ask the question whether we can transfer the appearance from a source NeRF onto a target 3D geometry in a semantically meaningful way, such that the resulting new NeRF retains the target geometry but has an appearance that is an analogy to the source NeRF. To this end, we generalize classic image analogies from 2D images to NeRFs. We leverage correspondence transfer along semantic affinity that is driven by semantic features from large, pre-trained 2D image models to achieve multi-view consistent appearance transfer. Our method allows exploring the mix-and-match product space of 3D geometry and appearance. We show that our method outperforms traditional stylization-based methods and that a large majority of users prefer our method over several typical baselines.	Introduces "NeRF analogies", a method for transferring visual appearance between NeRFs based on semantic affinity derived from ViT features.	Addresses limitations of existing NeRF editing techniques by enabling combined, multi-view consistent, and semantically meaningful appearance transfer onto arbitrary 3D geometry.	Leverages DiNO-ViT features to establish dense correspondences between source and target NeRF renderings, then trains a new NeRF to combine the target geometry with the transferred source appearance.	Outperforms traditional stylization and image-analogy methods in transferring appearance while preserving semantic consistency. Demonstrates superior multi-view consistency compared to 2D-based approaches, resulting in fewer artifacts and floaters. Exhibits strong performance in user studies, with participants significantly preferring the method's output for its quality and semantic coherence.	Reliance on accurate feature correspondences limits applicability to objects with rotational ambiguities or complex textures. Inability to transfer texture due to the point-based appearance transfer approach.	nerf, appearance transfer, semantic editing, vision transformer, 3d deep learning
2402.08601 Report	Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing	Yunji Jung, Seokju Lee, Tair Djanibekov, Hyunjung Shim, Jong Chul Ye	Text-guided non-rigid editing involves complex edits for input images, such as changing motion or compositions within their surroundings. Since it requires manipulating the input structure, existing methods often struggle with preserving object identity and background, particularly when combined with Stable Diffusion. In this work, we propose a training-free approach for non-rigid editing with Stable Diffusion, aimed at improving the identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the recent success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image's identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we suggest timestep-aware text inject sampling. This effectively retains the structure of the input image by injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments.	This paper proposes a training-free method for text-guided non-rigid image editing with Stable Diffusion, improving identity preservation without compromising editability.	Non-rigid editing with existing methods often struggle with preserving object identity and background, especially in Stable Diffusion, limiting practical applications.	The method utilizes text optimization for smooth editing, latent inversion for identity preservation, and timestep-aware text injection sampling for balancing identity and editability.	Outperforms baselines in qualitative comparisons, demonstrating superior identity preservation and edit fidelity. Quantitative evaluation shows higher CLIP and Aesthetic scores, indicating better alignment with target text and improved aesthetics. Ablation studies confirm the effectiveness of each component, particularly latent inversion and timestep-aware sampling.	Limitations exist in handling compositions with multiple objects and preserving high-frequency details. Future work includes exploring faster inversion methods and improving compositional editing capabilities.	image editing, non-rigid editing, stable diffusion, latent inversion, text optimization
2402.08577 Report	Test-Time Backdoor Attacks on Multimodal Large Language Models	Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin	Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.	The paper introduces "AnyDoor," a novel test-time backdoor attack against multimodal large language models (MLLMs) that injects backdoors during the test phase by leveraging adversarial test images, eliminating the need for training data manipulation.	This work exposes a significant security vulnerability in MLLMs, demonstrating that their multimodal capabilities can be exploited for malicious purposes even without access to training data.	AnyDoor employs techniques similar to universal adversarial attacks, generating a universal perturbation applied to input images that triggers harmful effects when a specific textual prompt is provided to the MLLM.	AnyDoor successfully attacks popular MLLMs like LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2 across various datasets. The attack remains effective with variations in trigger prompts and harmful outputs, posing challenges for defense mechanisms. The authors demonstrate AnyDoor's robustness under common corruptions and its applicability in dynamic video scenarios.	The current work mainly focuses on vision-language MLLMs. Investigating other modalities like audio/speech is left for future work. While the physical demonstrations are currently conceptual, future research should explore robust defense mechanisms to mitigate potential real-world threats.	multimodal large language models, test-time backdoor attacks, adversarial attacks, universal perturbations, model security
2402.08265 Report	A Dense Reward View on Aligning Text-to-Image Diffusion with Preference	Shentao Yang, Tianqi Chen, Mingyuan Zhou	Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach.	This paper introduces a novel method for aligning text-to-image diffusion models with preference by adopting a dense reward perspective and incorporating temporal discounting.	The traditional trajectory-level reward assumption used in DPO-style methods for text-to-image diffusion models leads to a large decision space and the sparse reward problem, hampering training effectiveness and efficiency. This paper addresses this issue by considering a finer, dense reward structure.	The authors derive a tractable alignment objective by assuming a latent reward function for each step of the diffusion reverse chain and introducing a temporal discount factor. This approach breaks the temporal symmetry in DPO-style losses and emphasizes the initial steps of the generation process, which are crucial for establishing image outlines and high-level attributes. The resulting objective is a lower bound of a Bradley-Terry preference model, leading to a tractable loss for training the model in an explicit-reward-free manner.	The method achieves competitive quantitative and qualitative results on single and multiple prompt generation tasks, surpassing strong baselines in terms of preference-generating metrics (ImageReward and HPSv2) and unseen Aesthetic scores. Further investigation reveals that the method effectively generates desired image shapes earlier in the reverse chain, supporting the hypothesis that emphasizing initial steps leads to improved final image quality. Ablation studies demonstrate the impact of the temporal discount factor and the robustness of the method to the choice of the KL coefficient.	The iterative data collection and model training procedure inherent to the off-policy learning routine introduces additional complexity and costs compared to purely offline methods. Storing the entire generation reverse chains, as opposed to only the final images, increases CPU memory and storage requirements.	text-to-image diffusion model, preference alignment, dense reward, direct preference optimization (dpo), sequential generation
2402.08018 Report	Nearest Neighbour Score Estimators for Diffusion Generative Models	Matthew Niedoba, Dylan Green, Saeid Naderiparizi, Vasileios Lioutas, Jonathan Wilder Lavington, Xiaoxuan Liang, Yunpeng Liu, Ke Zhang, Setareh Dabiri, Adam Ścibior, Berend Zwartsenberg, Frank Wood	Score function estimation is the cornerstone of both training and sampling from diffusion generative models. Despite this fact, the most commonly used estimators are either biased neural network approximations or high variance Monte Carlo estimators based on the conditional score. We introduce a novel nearest neighbour score function estimator which utilizes multiple samples from the training set to dramatically decrease estimator variance. We leverage our low variance estimator in two compelling applications. Training consistency models with our estimator, we report a significant increase in both convergence speed and sample quality. In diffusion models, we show that our estimator can replace a learned network for probability-flow ODE integration, opening promising new avenues of future research.	This paper introduces a novel nearest neighbour score function estimator for diffusion generative models, leveraging multiple training samples to reduce variance.	Score function estimation is crucial for training and sampling in diffusion models, but existing methods suffer from bias (neural networks) or high variance (Monte Carlo).	The method uses self-normalized importance sampling with a proposal distribution based on k-nearest neighbors in the training set, exploiting the Gaussian nature of diffusion processes.	The proposed estimator exhibits near-zero variance and bias, outperforming existing estimators and even a near-SoTA diffusion model on CIFAR-10. Using the estimator in consistency models leads to faster convergence and better sample quality compared to single-sample baselines. The estimator enables general probability flow ODE traversal and highlights the role of network bias in diffusion model generalization.	The paper primarily focuses on the EDM diffusion process, with generalization to other processes requiring further investigation. While the l2 distance used for nearest neighbour search is computationally efficient, exploring alternative metric spaces might further improve performance.	diffusion models, score function estimation, nearest neighbours, importance sampling, generative models
2402.07562 Report	Discovering Universal Semantic Triggers for Text-to-Image Synthesis	Shengfang Zhai, Weilong Wang, Jiajun Li, Yinpeng Dong, Hang Su, Qingni Shen	Recently text-to-image models have gained widespread attention in the community due to their controllable and high-quality generation ability. However, the robustness of such models and their potential ethical issues have not been fully explored. In this paper, we introduce Universal Semantic Trigger, a meaningless token sequence that can be added at any location within the input text yet can induce generated images towards a preset semantic target.To thoroughly investigate it, we propose Semantic Gradient-based Search (SGS) framework. SGS automatically discovers the potential universal semantic triggers based on the given semantic targets. Furthermore, we design evaluation metrics to comprehensively evaluate semantic shift of images caused by these triggers. And our empirical analyses reveal that the mainstream open-source text-to-image models are vulnerable to our triggers, which could pose significant ethical threats. Our work contributes to a further understanding of text-to-image synthesis and helps users to automatically auditing their models before deployment.	This paper introduces 'Universal Semantic Triggers,' meaningless token sequences that can be inserted into text prompts for text-to-image models, causing the generated images to exhibit specific, pre-determined semantic features.	This is important because it reveals a vulnerability in text-to-image models that could be exploited to generate harmful or sensitive content, bypassing existing safety measures like text filters.	The authors propose a 'Semantic Gradient-based Search (SGS)' framework. SGS uses a gradient-based approach to automatically discover these trigger sequences by minimizing the distance in the text encoder's embedding space between trigger-inserted text and text explicitly describing the target semantic.	Experiments demonstrate the effectiveness of these triggers across various text-to-image models (Stable Diffusion versions, Latent Diffusion) and even online platforms like Midjourney. The triggers exhibit a degree of position insensitivity, remaining effective even when inserted at different locations within the text prompt. The authors demonstrate the potential for increased harm through 'ensemble triggers,' where multiple trigger sequences are combined to imbue images with multiple semantic features simultaneously.	While the paper demonstrates the existence and potential dangers of these triggers, it doesn't offer concrete mitigation strategies. The evaluation of 'harmful' or 'sensitive' content relies heavily on user studies and subjective judgment, which can be inherently biased.	text-to-image synthesis, adversarial attacks, semantic triggers, ethical ai, clip
2402.07384 Report	Exploring Perceptual Limitation of Multimodal Large Language Models	Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun	Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.	This paper reveals a perceptual limitation in Multimodal Large Language Models (MLLMs) when perceiving small objects and investigates the impact of object quality, size, distractors, and location on this limitation.	This work provides a deeper understanding of the perceptual limitations of MLLMs, which is crucial for both practical applications and future model development. It also introduces a new evaluation protocol for analyzing the perception of future MLLMs.	The authors conduct controlled experiments on five open-source MLLMs using synthetic images of digital texts with varying quality, size, distractor presence, and location. The evaluation focuses on text-reading ability using Gestalt Pattern Matching.	Object quality (sampling rate) significantly impacts performance up to a threshold, beyond which performance stabilizes. This threshold aligns with human perception. Smaller object size, even with sufficient quality, reduces performance in most MLLMs. Models trained on datasets with smaller objects show less sensitivity to size variations. The presence of visual distractors and the object's location within the image significantly affect the performance of MLLMs.	The study primarily uses synthetic digital texts for evaluation, potentially limiting the generalizability of findings to other visual tasks. Further investigation is needed to understand the specific mechanisms within MLLMs that contribute to the observed limitations, particularly the impact of object location.	multimodal large language models, perception, small objects, visual question answering, robustness analysis
2402.07370 Report	SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder	Jaeseong Lee, Junha Hyung, Sohyun Jeong, Jaegul Choo	Face swapping has gained significant attention for its varied applications. The majority of previous face swapping approaches have relied on the seesaw game training scheme, which often leads to the instability of the model training and results in undesired samples with blended identities due to the target identity leakage problem. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach designed to enhance face swapping model training. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. It effectively mitigates identity leakage by masking facial regions of the input images and utilizing learned disentangled identity and non-identity features. Additionally, we tackle the shape misalignment problem with new techniques including perforation confusion and random mesh scaling, and establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes, without sacrificing on either aspect.	This paper proposes Shape Agnostic Masked AutoEncoder (SAMAE), a novel self-supervised training scheme for face swapping that mitigates identity leakage and enhances training stability.	Existing face swapping methods rely on an unstable seesaw game training scheme, leading to identity blending and the need for extensive hyperparameter tuning.	SAMAE uses self-reconstruction with face-masked images, disentangled identity and non-identity features, and introduces perforation confusion and random mesh scaling to improve cross-identity swapping.	SAMAE outperforms state-of-the-art methods in identity preservation, attribute fidelity, and overall image realism. Perforation confusion and random mesh scaling are crucial for handling shape misalignment and volume discrepancies between source and target faces. Disentangling skin color from identity embeddings improves the realism of the swapped faces.	The model's performance is limited by the accuracy of the 3DMM estimator, particularly for exaggerated expressions. Future work could explore incorporating stronger generative priors like StyleGAN or diffusion models.	face swapping, self-supervised learning, identity leakage, 3d morphable model, generative adversarial networks
2402.07207 Report	GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting	Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang	We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an object-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. Source codes and models will be available at https://gala3d.github.io/.	This paper introduces \ourmethod{}, a novel layout-guided generative Gaussian splatting framework for generating complex 3D scenes from text descriptions.	Existing text-to-3D methods struggle to generate complex scenes with multiple objects and their interactions. \ourmethod{} addresses this by leveraging layout priors and compositional optimization for enhanced control and fidelity.	\ourmethod{} first uses LLMs to interpret text into coarse layouts. Then, it introduces a layout-guided Gaussian representation and utilizes adaptive geometry control to optimize the shape and distribution of Gaussians. A compositional optimization strategy with diffusion priors is employed to generate the final 3D scene, while a layout refinement module iteratively improves the LLM-generated layouts.	\ourmethod{} outperforms existing NeRF-based, voxel-based, and 3DGS-based methods in text-to-3D scene generation, achieving higher CLIP scores and better visual quality. User studies confirm that \ourmethod{} generates higher-quality 3D scenes with better geometry, text alignment, and consistency compared to other SOTA approaches. \ourmethod{} supports interactive editing of generated scenes through textual conversations, enabling users to easily modify object placement, add/remove objects, and adjust styles.	The reliance on LLMs for layout interpretation can introduce errors due to the LLMs' limited 3D scene understanding. Further research can explore incorporating more detailed semantic information and object relationships into the layout representation for enhanced control.	text-to-3d generation, generative gaussian splatting, layout-guided generation, compositional 3d generation, large language models
2402.07181 Report	3D Gaussian as a New Vision Era: A Survey	Ben Fei, Jingyi Xu, Rui Zhang, Qingyuan Zhou, Weidong Yang, Ying He	3D Gaussian Splatting (3D-GS) has emerged as a significant advancement in the field of Computer Graphics, offering explicit scene representation and novel view synthesis without the reliance on neural networks, such as Neural Radiance Fields (NeRF). This technique has found diverse applications in areas such as robotics, urban mapping, autonomous navigation, and virtual reality/augmented reality, just name a few. Given the growing popularity and expanding research in 3D Gaussian Splatting, this paper presents a comprehensive survey of relevant papers from the past year. We organize the survey into taxonomies based on characteristics and applications, providing an introduction to the theoretical underpinnings of 3D Gaussian Splatting. Our goal through this survey is to acquaint new researchers with 3D Gaussian Splatting, serve as a valuable reference for seminal works in the field, and inspire future research directions, as discussed in our concluding section.	This paper presents a comprehensive survey of 3D Gaussian Splatting (3D-GS) research from the past year, categorizing advancements and applications to guide new researchers and inspire future research directions.	3D-GS has emerged as a powerful technique in computer graphics for efficiently rendering complex scenes, offering explicit scene representation and novel view synthesis without relying on neural networks like NeRF.	The paper organizes research into taxonomies based on characteristics (efficiency, realism, cost, physics) and applications (reconstruction, manipulation, generation, perception, and virtual humans).	Various methods have been proposed to compress 3D Gaussian representations, improve rendering realism by addressing aliasing and incorporating physics-based rendering, and reduce the number of images needed for novel view synthesis. 3D-GS has shown promise in tasks like mesh reconstruction, text-guided scene manipulation, single/multi-view 3D generation, semantic object detection, dynamic scene tracking, and virtual human avatar creation. Researchers are actively exploring real-time rendering of dynamic scenes, incorporating accurate physics simulations, and expanding 3D-GS capabilities by integrating with large foundation models.	Current 3D-GS methods face challenges in handling floating elements, balancing rendering and reconstruction quality, and achieving realistic generation with accurate textures and geometry. Future work could focus on addressing these challenges, improving performance in few-shot scenarios, and exploring applications in areas like robotics and autonomous vehicles.	3d gaussian splatting, 3d-gs, computer graphics, novel view synthesis, 3d scene reconstruction
2402.06149 Report	HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting	Zhenglin Zhou, Fan Ma, Hehe Fan, Yi Yang	Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising outcomes obtained through 2D diffusion priors in recent works, current methods face challenges in achieving high-quality and animated avatars effectively. In this paper, we present $\textbf{HeadStudio}$, a novel framework that utilizes 3D Gaussian splatting to generate realistic and animated avatars from text prompts. Our method drives 3D Gaussians semantically to create a flexible and achievable appearance through the intermediate FLAME representation. Specifically, we incorporate the FLAME into both 3D representation and score distillation: 1) FLAME-based 3D Gaussian splatting, driving 3D Gaussian points by rigging each point to a FLAME mesh. 2) FLAME-based score distillation sampling, utilizing FLAME-based fine-grained control signal to guide score distillation from the text prompt. Extensive experiments demonstrate the efficacy of HeadStudio in generating animatable avatars from textual prompts, exhibiting visually appealing appearances. The avatars are capable of rendering high-quality real-time ($\geq 40$ fps) novel views at a resolution of 1024. They can be smoothly controlled by real-world speech and video. We hope that HeadStudio can advance digital avatar creation and that the present method can widely be applied across various domains.	HeadStudio, a novel framework leveraging 3D Gaussian splatting to generate realistic and animatable head avatars from text prompts.	Current text-based avatar generation methods struggle to effectively combine high-fidelity appearance with smooth animation.	HeadStudio incorporates FLAME, a statistical head model, to semantically align 3D Gaussian points and guide score distillation from text prompts using: 1) FLAME-based 3D Gaussian Splatting (F-3DGS) for deformation, and 2) FLAME-based Score Distillation Sampling (F-SDS) for knowledge distillation.	Generates high-fidelity head avatars surpassing state-of-the-art methods in visual quality. Achieves effective semantic alignment for smooth and accurate animation of facial expressions. Enables real-time rendering at ≥ 40 fps, suitable for augmented and virtual reality applications.	Limited to head avatar generation, further research is needed for full-body avatars. Relies on pre-trained diffusion models, inheriting potential biases and limitations.	text-to-3d, avatar generation, 3d gaussian splatting, score distillation, flame
2402.06117 Report	Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for Motion Deblurring	Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan	This paper tackles the problem of motion deblurring of dynamic scenes. Although end-to-end fully convolutional designs have recently advanced the state-of-the-art in non-uniform motion deblurring, their performance-complexity trade-off is still sub-optimal. Most existing approaches achieve a large receptive field by increasing the number of generic convolution layers and kernel size. In this work, we propose a pixel adaptive and feature attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We design a content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighboring pixel information. We further introduce a pixel-adaptive non-uniform sampling strategy that implicitly discovers the difficult-to-restore regions present in the image and, in turn, performs fine-grained refinement in a progressive manner. Extensive qualitative and quantitative comparisons with prior art on deblurring benchmarks demonstrate that our approach performs favorably against the state-of-the-art deblurring algorithms.	This paper proposes a Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for more efficient and effective motion deblurring of dynamic scenes.	Existing end-to-end fully convolutional designs for motion deblurring have sub-optimal performance-complexity trade-offs, struggling to handle large blur variations efficiently. This paper addresses this by introducing spatially adaptive and content-aware mechanisms within a CNN.	The paper utilizes a multi-patch hierarchical network with content-aware processing modules that combine global attention and adaptive local filters. It introduces non-uniform pixel-adaptive sampling to prioritize the processing of heavily blurred regions and incorporates progressive image restoration with ground truth supervision at each stage.	The approach achieves state-of-the-art deblurring performance on benchmark datasets like GoPro, HIDE, and RealBlur. It offers a better accuracy-speed trade-off than methods relying solely on increasing network depth or filter size. The adaptive sampling strategy is shown to significantly improve performance by efficiently distributing computation based on blur severity.	The current implementation relies on custom operations that are less optimized in standard deep learning libraries, leading to slower runtime than some purely convolutional networks despite lower theoretical complexity (GFLOPs). Future work includes exploring the application of the proposed adaptive sampling strategy to other image restoration tasks.	image deblurring, spatially adaptive, attention mechanism, adaptive sampling, deep learning
2402.05947 Report	Separable Multi-Concept Erasure from Diffusion Models	Mengnan Zhao, Lihe Zhang, Tianhang Zheng, Yuqiu Kong, Baocai Yin	Large-scale diffusion models, known for their impressive image generation capabilities, have raised concerns among researchers regarding social impacts, such as the imitation of copyrighted artistic styles. In response, existing approaches turn to machine unlearning techniques to eliminate unsafe concepts from pre-trained models. However, these methods compromise the generative performance and neglect the coupling among multi-concept erasures, as well as the concept restoration problem. To address these issues, we propose a Separable Multi-concept Eraser (SepME), which mainly includes two parts: the generation of concept-irrelevant representations and the weight decoupling. The former aims to avoid unlearning substantial information that is irrelevant to forgotten concepts. The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure without affecting generative performance on other concepts. Specifically, the weight increment for erasing a specified concept is formulated as a linear combination of solutions calculated based on other known undesirable concepts. Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.	This paper introduces SepME, a novel machine unlearning technique for diffusion models that enables the flexible erasure and recovery of multiple concepts while preserving overall model performance.	Existing methods for removing unsafe or undesirable concepts from pre-trained diffusion models often lead to performance degradation and struggle with multi-concept erasure and restoration.	SepME consists of two key components: G-CiRs generates concept-irrelevant representations to preserve model performance, and WD decouples weight increments for individual concept erasure, allowing for flexible concept manipulation.	SepME effectively removes targeted concepts while maintaining high generation quality for other concepts. It outperforms baseline methods in terms of both concept erasure and overall model performance. SepME enables flexible manipulation of concepts, including simultaneous multi-concept erasure, iterative concept erasure, and concept restoration.	The cosine function as an alternative to the correlation term in SepME did not yield satisfactory results. Future work will focus on exploring different architectures and optimization strategies for further improving SepME's efficiency and effectiveness.	machine unlearning, diffusion models, concept erasure, concept restoration, stable diffusion
2402.05937 Report	InstaGen: Enhancing Object Detection by Training on Synthetic Dataset	Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma	In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen.	This paper presents InstaGen, a novel framework that enhances object detection by training on synthetic datasets generated from diffusion models. InstaGen incorporates an instance-level grounding head into a pre-trained diffusion model, enabling the generation of photo-realistic images with bounding boxes for object instances.	Building large-scale object detection datasets is labor-intensive and time-consuming. InstaGen offers a solution by synthesizing high-quality, annotated images, facilitating object detection model development and enhancing their capabilities.	The methodology involves (1) fine-tuning a pre-trained Stable Diffusion Model (SDM) on an existing object detection dataset to create an image synthesizer, and (2) training an instance-level grounding head. The grounding head aligns text embeddings of category names with regional visual features from the image synthesizer to predict bounding boxes for objects in synthetic images.	InstaGen outperforms state-of-the-art CLIP-based open-vocabulary object detection methods, achieving a +4.5 AP improvement on the COCO benchmark. The synthetic datasets generated by InstaGen are particularly beneficial in data-sparse scenarios, showing significant performance improvement (+1.2 to +5.2 AP) when real training data is limited. InstaGen effectively generalizes to unseen datasets, achieving superior performance in cross-dataset object detection when transferring from COCO to Object365 and LVIS.	The synthetic datasets generated by InstaGen may lack the complexity and contextual diversity of real-world scenes, limiting the robustness of trained object detectors. Current diffusion-based generative models, including those used in InstaGen, face challenges in representing and generating images for rare object categories, leading to potential class imbalance during training.	object detection, synthetic dataset, diffusion model, open-vocabulary detection, data-sparse detection
2402.05892 Report	Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data	Shufan Li, Harkanwar Singh, Aditya Grover	In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs prohibitive compute and memory complexity that scales quadratically w.r.t. the sequence length. A recent architecture, Mamba, based on state space models has been shown to achieve comparable performance for modeling text sequences, while scaling linearly with the sequence length. In this work, we present Mamba-ND, a generalized design extending the Mamba architecture to arbitrary multi-dimensional data. Our design alternatively unravels the input data across different dimensions following row-major orderings. We provide a systematic comparison of Mamba-ND with several other alternatives, based on prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND. Empirically, we show that Mamba-ND demonstrates performance competitive with the state-of-the-art on a variety of multi-dimensional benchmarks, including ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather forecasting.	The paper introduces Mamba-ND, a simple yet effective method for extending state space models (specifically, the Mamba architecture) to multi-dimensional data like images and videos.	Transformers, while dominant in sequence modeling, have quadratic complexity, making them hard to scale. Mamba, based on state space models, offers linear complexity and competitive performance but lacked multi-dimensional extension, which Mamba-ND addresses.	Mamba-ND leverages the efficient 1D Mamba layers and achieves multi-dimensionality by simply alternating the sequence ordering (e.g., height, width, time) across layers.	Mamba-ND outperforms Transformers (ViT, Swin) on ImageNet-1K classification, HMDB-51 and UCF-101 action recognition, ERA5 weather forecasting, and BTCV 3D segmentation, often with fewer parameters. Extensive ablations show the alternating-directional design surpasses more complex layer arrangements and scan factorization techniques. Alternating ordering leads to better effective receptive fields compared to uni-directional or bi-directional baselines.	The vast design space of possible orderings is not fully explored, with only row-major variations tested. While offering linear complexity, the current implementation of scan factorization leads to memory and runtime overhead, needing future optimization.	state space models, multi-dimensional modeling, vision transformers, sequence modeling, mamba
2402.05889 Report	CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion	Shoubin Yu, Jaehong Yoon, Mohit Bansal	Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited in their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra human annotation by leveraging existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities. We validate our method on video-3D, video-audio, and video-language reasoning tasks and achieve better/equivalent performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and SeViLA while using 96% fewer trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.	This paper proposes CREMA, an efficient and modular modality-fusion framework for video reasoning that can integrate diverse modalities (e.g., video, audio, 3D point cloud) using parameter-efficient adapters and a novel self-gated fusion module (CREMA-Espresso).	Current Multimodal Large Language Models (MLLMs) are computationally expensive and lack flexibility when adapting to new modalities, especially for video reasoning tasks that can benefit from diverse sensory inputs.	CREMA leverages a frozen pre-trained vision-language model and introduces lightweight Modality-specific Multi-Query Adapters (MMQAs) with LoRA, learnable queries, and linear projections for each modality. CREMA-Espresso further fuses multimodal queries efficiently using self-gated attention.	CREMA outperforms modality-specific baselines on SQA3D, MUSIC-AVQA, and NeXT-QA, showing improvements of +3.3%, +1.9%, and +0.9% respectively with significantly fewer parameters (2-4% of baselines). It also achieves better or comparable performance than general-purpose MLLMs like BLIP-2 and 3D-LLM in both fine-tuning and zero-shot settings. Analysis demonstrates the efficiency of the self-gated fusion module, the impact of additional modalities on answering hard questions, and provides qualitative visualizations of model reasoning.	The reliance on pre-trained vision-language models may introduce potential biases present in the training data. Future work includes exploring the impact of varying LoRA ranks, optimizing the MMQA pre-training process, and evaluating on more diverse video reasoning benchmarks.	multimodal learning, video reasoning, large language models, modality fusion, parameter efficiency
2402.05803 Report	AvatarMMC: 3D Head Avatar Generation and Editing with Multi-Modal Conditioning	Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, Peter Wonka	We introduce an approach for 3D head avatar generation and editing with multi-modal conditioning based on a 3D Generative Adversarial Network (GAN) and a Latent Diffusion Model (LDM). 3D GANs can generate high-quality head avatars given a single or no condition. However, it is challenging to generate samples that adhere to multiple conditions of different modalities. On the other hand, LDMs excel at learning complex conditional distributions. To this end, we propose to exploit the conditioning capabilities of LDMs to enable multi-modal control over the latent space of a pre-trained 3D GAN. Our method can generate and edit 3D head avatars given a mixture of control signals such as RGB input, segmentation masks, and global attributes. This provides better control over the generation and editing of synthetic avatars both globally and locally. Experiments show that our proposed approach outperforms a solely GAN-based approach both qualitatively and quantitatively on generation and editing tasks. To the best of our knowledge, our approach is the first to introduce multi-modal conditioning to 3D avatar generation and editing. \\href{avatarmmc-sig24.github.io}{Project Page}	This paper proposes AvatarMMC, a novel framework for 3D head avatar generation and editing with multi-modal conditioning, using a 1D Latent Diffusion Model (LDM) to control the latent space of a pre-trained 3D GAN (Next3D).	Existing methods for 3D avatar generation often struggle to incorporate multiple conditions simultaneously, limiting their controllability. AvatarMMC addresses this by enabling multi-modal control over avatar generation and editing, combining the quality of 3D GANs with the controllability of diffusion models.	The method utilizes a pre-trained Next3D GAN for avatar generation and a 1D LDM to learn the mapping between multi-modal conditions (RGB input, segmentation masks, and attributes) and the GAN's latent space. Different encoders embed these conditions into a common space, and cross-attention layers in the LDM incorporate the conditions during the denoising process.	AvatarMMC generates high-quality, diverse avatars adhering to various multi-modal conditions (e.g., RGB images, segmentation masks, and attributes). It enables high-fidelity avatar editing while preserving identity compared to a GAN-based baseline. The method is lightweight and fast for training and sampling, as it doesn't require retraining the GAN.	The method inherits biases present in the training data and methods of the pre-trained 3D GAN. Future work could explore incorporating more conditioning modalities (e.g., sketches, landmarks) and joint control over animation.	3d avatar generation, multi-modal conditioning, latent diffusion models, generative adversarial networks, avatar editing
2402.05608 Report	Scalable Diffusion Models with State Space Backbone	Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang	This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space. Given its notable efficacy in accommodating long-range dependencies, Diffusion State Space Models (DiS) are distinguished by treating all inputs including time, condition, and noisy image patches as tokens. Our assessment of DiS encompasses both unconditional and class-conditional image generation scenarios, revealing that DiS exhibits comparable, if not superior, performance to CNN-based or Transformer-based U-Net architectures of commensurate size. Furthermore, we analyze the scalability of DiS, gauged by the forward pass complexity quantified in Gflops. DiS models with higher Gflops, achieved through augmentation of depth/width or augmentation of input tokens, consistently demonstrate lower FID. In addition to demonstrating commendable scalability characteristics, DiS-H/2 models in latent space achieve performance levels akin to prior diffusion models on class-conditional ImageNet benchmarks at the resolution of 256$\times$256 and 512$\times$512, while significantly reducing the computational burden. The code and models are available at: https://github.com/feizc/DiS.	This paper introduces DiS, a novel diffusion model architecture employing a state space backbone instead of the traditional U-Net structure for image generation.	DiS aims to leverage the state space model's strength in handling long-range dependencies for efficient and scalable image generation, potentially surpassing CNN-based and Transformer-based U-Net models.	DiS treats all inputs (time, condition, noisy image patches) as tokens processed by a bidirectional Mamba architecture, incorporating skip connections and a linear decoder for noise prediction.	DiS achieves comparable performance to U-Net and Transformer-based models on CIFAR10 and CelebA 64x64 with fewer parameters. Scaling DiS by increasing depth/width consistently improves FID scores on ImageNet 256x256. DiS-H/2 achieves state-of-the-art FID on ImageNet 256x256 and outperforms ADM-G on ImageNet 512x512 in latent space.	Model's performance hasn't fully converged, suggesting potential for further improvement. Exploration of larger models and token counts is left for future work.	diffusion models, state space models, image generation, scalability, mamba architecture
2402.05472 Report	Question Aware Vision Transformer for Multimodal Reasoning	Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman	Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled from user queries, often in the form of image-related questions. Consequently, the resulting visual features may not be optimally attuned to the query-specific elements of the image. To address this, we introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning, which embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question. QA-ViT is model-agnostic and can be incorporated efficiently into any VL architecture. Extensive experiments demonstrate the effectiveness of applying our method to various multimodal architectures, leading to consistent improvement across diverse tasks and showcasing its potential for enhancing visual and scene-text understanding.	This paper introduces \AlgoNameNoSpace, a question-aware vision transformer approach for multimodal reasoning. \AlgoNameNoSpace embeds question awareness directly into the vision encoder, resulting in dynamic visual features focused on relevant image aspects.	Existing Vision-Language models suffer from a decoupling of vision encoding and user queries. This leads to suboptimal visual features that may not be attuned to the specific elements of an image relevant to the query.	\AlgoNameNoSpace uses a two-stage process: 1) a question encoding module processes the textual prompt into representations, 2) a question fusing module integrates the representations into the vision model via the self-attention mechanism. This approach allows the model to extract text-aware visual features.	\AlgoNameNoSpace leads to substantial and consistent improvements across diverse VL tasks and architectures, including ViT+T5, BLIP2, InstructBLIP, and LLaVA-1.5. \AlgoNameNoSpace shows significant benefits in scenarios requiring reasoning over nuanced, low-level image details, which are often overlooked by standard vision encoders. The method exhibits consistent performance gains across various LLM scales, demonstrating its compatibility with different model sizes.	While \AlgoNameNoSpace demonstrates strong performance in natural image domains, its effectiveness is limited in dense-text scenarios like document understanding. Future work could explore designated pretraining techniques tailored for \AlgoNameNoSpace to further enhance its capabilities.	vision-language models, multimodal reasoning, question answering, image captioning, vision transformer
2402.05408 Report	MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis	Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, Yi Yang	We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction. Code and demos will be released at https://migcproject.github.io/.	This paper presents Multi-Instance Generation (MIG), a task focused on generating multiple instances with diverse user controls within a single image, along with a novel approach named Multi-Instance Generation Controller (MIGC) to address this task.	MIG tackles limitations of single-instance generation, offering more versatile and practical applications in image synthesis by enabling control over quantity, position, attributes, and interaction of multiple instances.	MIGC leverages a divide and conquer strategy: dividing the task into single-instance shading subtasks, conquering them using an Enhancement Attention Layer, and combining the results via Layout Attention and a Shading Aggregation Controller.	On the COCO-MIG benchmark, MIGC significantly improved Instance Success Rate from 32.39% to 58.43%. On the COCO benchmark, MIGC demonstrated notable improvements in Average Precision (AP), increasing it from 40.68/68.26/42.85 to 54.69/84.17/61.71. On DrawBench, MIGC achieved advancements across position, attribute, and count control, especially raising the attribute success rate from 48.20% to 97.50%.	MIGC relies on the single-instance generation capabilities of the pre-trained stable diffusion model. If stable diffusion struggles to generate a specific instance, MIGC will also face difficulties. While MIGC exhibits strong control over instance attributes and positions, further research is needed to enhance the control of interactive relationships between instances.	multi-instance generation, text-to-image synthesis, layout control, stable diffusion, attention mechanisms
2402.05382 Report	Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts	Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, James T. Kwok	Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.	This paper proposes MoCE (Mixture of Cluster-conditional Experts), a novel MAE-based pre-training paradigm that addresses the negative transfer problem in MAE by providing customized pre-trained models for diverse downstream tasks.	MAE, while effective for model pre-training, can suffer from negative transfer when applied to downstream tasks with data distributions different from the pre-training data. This limits MAE’s scalability and transferability.	MoCE first clusters the dataset using a pre-trained MAE. Then, it trains a multi-expert architecture where each expert focuses on images from specific clusters with similar semantics, guided by cluster-conditional gates. For deployment, MoCE selects the most relevant expert for each downstream task based on its data distribution.	MoCE outperforms vanilla MAE by 2.45% on average across 11 downstream tasks. MoCE achieves state-of-the-art self-supervised learning results on detection and segmentation. MoCE demonstrates superior performance compared to TokenMoE and SDR, showcasing the effectiveness of its cluster-conditional expert routing.	The number of experts and clusters might be a bottleneck for further performance improvement. Exploration on larger datasets and more diverse downstream tasks is needed.	self-supervised learning, masked autoencoder, mixture of experts, negative transfer, task-customized pre-training
2402.05375 Report	Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models	Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang	The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).	This paper introduces a novel method for suppressing the generation of undesired content (negative targets) in text-to-image diffusion models by manipulating text embeddings, enabling more precise control over image generation.	Current text-to-image models struggle to effectively omit content explicitly requested to be excluded in the prompt, limiting precise image generation control.	The method utilizes two steps: 1) Soft-weighted regularization: Applying SVD to a negative target embedding matrix, then regularizing singular values to suppress negative target information in the [EOT] embeddings. 2) Inference-time text embedding optimization: Optimizing the whole text embeddings with two losses - negative target prompt suppression (weakens negative target attention) and positive target prompt preservation (strengthens desired content attention).	The proposed method effectively suppresses negative target generation without needing to fine-tune the image generator or collect paired images. Quantitative and qualitative evaluation on various datasets demonstrate superior performance compared to existing baselines, achieving the best scores in Clipscore, DetScore and comparable IFID. The method proves versatile, applicable to both pixel-space and latent-space diffusion models, and adaptable for tasks like image restoration and content strengthening.	The current implementation requires around 30 seconds for inference-time optimization, limiting its practicality in real-time applications. The method relies on concise prompts primarily describing objects, struggling with lengthy and abstract descriptions.	text-to-image generation, diffusion models, negative content suppression, text embedding manipulation, image editing
2402.05235 Report	SPAD : Spatially Aware Multiview Diffusers	Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin	We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad	This paper introduces SPAD, a novel framework that leverages pre-trained text-to-image diffusion models to generate consistent multi-view images from text prompts or single images.	Generating high-quality 3D content is crucial for various applications. SPAD addresses limitations in existing methods by incorporating 3D understanding into 2D diffusion models, enabling consistent multi-view generation with precise camera control.	The authors extend a pre-trained 2D diffusion model with cross-view interactions using Epipolar Attention and Plücker Ray Embeddings. Epipolar Attention restricts attention to epipolar lines, enhancing 3D consistency. Plücker Embeddings provide positional encoding based on camera rays, preventing object flipping artifacts.	SPAD achieves state-of-the-art results in novel view synthesis on unseen objects from Objaverse and Google Scanned Objects datasets. The method demonstrates better camera control and generates consistent multi-view images from diverse viewpoints. SPAD effectively prevents the multi-face Janus issue in text-to-3D generation using multi-view Score Distillation Sampling.	Limitations: The method currently relies on a two-view training setup and could benefit from exploring monocular depth estimation for improved correspondences. Future Work: Extending SPAD to generate dynamic 4D assets and multi-object scenes, as well as leveraging larger diffusion models like SDXL for enhanced performance.	multi-view generation, text-to-3d synthesis, diffusion models, epipolar geometry, plücker coordinates
2402.05195 Report	$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space	Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang	Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal Large Language Models (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of training. These subject-driven T2I methods hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. In this paper, we present $\lambda$-ECLIPSE, an alternative prior-training strategy that works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models. $\lambda$-ECLIPSE leverages the image-text interleaved pre-training for fast and effective multi-subject-driven P-T2I. Through extensive experiments, we establish that $\lambda$-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization. $\lambda$-ECLIPSE performs multi-subject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours. Additionally, $\lambda$-ECLIPSE demonstrates the unique ability to perform multi-concept interpolations.	\ours~ is a resource-efficient, diffusion-independent prior learning strategy for enabling fast multi-subject customization in personalized text-to-image generation.	Existing personalized text-to-image generation methods, especially those involving multi-subject customization, are computationally expensive, requiring significant GPU hours and large models. \ours~ addresses this resource efficiency issue.	\ours~ leverages a contrastive text-to-image strategy within the latent space of a pre-trained CLIP model, eliminating the dependence on diffusion models during training. It employs an image-text interleaved pre-training approach, substituting subject-specific text embeddings with corresponding image embeddings, and incorporates Canny edge maps for enhanced control over image generation.	\ours~ achieves competitive performance in composition alignment while maintaining concept alignment, even with significantly lower resource utilization (34M parameters and 74 GPU hours). It outperforms baseline methods on the Multibench dataset for multi-subject generation, particularly in text-composition alignment. The method effectively incorporates Canny edge maps as conditional guidance, balancing subject details with edge map adherence, unlike other methods that overemphasize edge maps.	CLIP's limitations in capturing hierarchical representations can lead to less-than-ideal results, particularly for complex subjects. While significantly more efficient, there is still a performance gap between \ours~ and fine-tuning-based methods, suggesting room for improvement potentially through larger datasets and models.	personalized text-to-image generation, multi-subject customization, resource-efficient, diffusion-free, clip latent space
2402.05054 Report	LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation	Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, Ziwei Liu	3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.	This paper introduces LGM, a novel framework that generates high-resolution 3D models from text prompts or single-view images using multi-view Gaussian features and an asymmetric U-Net backbone.	Current 3D generation methods are either slow (optimization-based) or limited in resolution (feed-forward). LGM aims to achieve both high fidelity and speed in 3D content creation.	LGM utilizes an asymmetric U-Net to predict and fuse 3D Gaussian features from multi-view images. It leverages existing multi-view diffusion models for image/text-to-multi-view image generation and employs data augmentation for robust training. A mesh extraction algorithm converts 3D Gaussians to polygonal meshes.	LGM generates high-quality 3D Gaussians and meshes, outperforming previous methods in both image-to-3D and text-to-3D tasks. The method maintains fast generation speed (around 5 seconds) while significantly increasing training resolution (up to 512). LGM demonstrates good diversity in generating various plausible 3D objects from a single input.	The quality of LGM's output depends on the accuracy and resolution of the multi-view images generated by diffusion models. Current multi-view diffusion models struggle with high elevation angles and may produce inconsistent 3D information, affecting the final 3D model quality.	3d generation, gaussian splatting, high resolution, multi-view diffusion models, u-net
2402.05008 Report	EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss	Zhuoyang Zhang, Han Cai, Song Han	We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.	Presents EfficientViT-SAM, an accelerated version of Segment Anything Model (SAM) using EfficientViT for improved efficiency in image segmentation.	Addresses the high computational cost of SAM, making it more practical for time-sensitive applications while maintaining performance.	Replaces SAM's image encoder with EfficientViT. The model is trained in two phases: knowledge distillation from SAM-ViT-H to EfficientViT and end-to-end training on the SA-1B dataset.	EfficientViT-SAM achieves a 17x to 69x speedup compared to SAM. It outperforms other accelerated SAM models in terms of both efficiency and accuracy on zero-shot segmentation benchmarks. EfficientViT-SAM demonstrates strong performance in point-prompted, box-prompted, and segment-everything segmentation modes.	The model's performance might be further enhanced by exploring advanced knowledge distillation techniques. Future work can investigate the application of EfficientViT-SAM in real-world scenarios such as video segmentation.	image segmentation, segment anything model, efficientvit, zero-shot learning, model acceleration
2402.04930 Report	Blue noise for diffusion models	Xingchang Huang, Corentin Salaün, Cristina Vasconcelos, Christian Theobalt, Cengiz Öztireli, Gurprit Singh	Most of the existing diffusion models use Gaussian noise for training and sampling across all time steps, which may not optimally account for the frequency contents reconstructed by the denoising network. Despite the diverse applications of correlated noise in computer graphics, its potential for improving the training process has been underexplored. In this paper, we introduce a novel and general class of diffusion models taking correlated noise within and across images into account. More specifically, we propose a time-varying noise model to incorporate correlated noise into the training process, as well as a method for fast generation of correlated noise mask. Our model is built upon deterministic diffusion models and utilizes blue noise to help improve the generation quality compared to using Gaussian white (random) noise only. Further, our framework allows introducing correlation across images within a single mini-batch to improve gradient flow. We perform both qualitative and quantitative evaluations on a variety of datasets using our method, achieving improvements on different tasks over existing deterministic diffusion models in terms of FID metric.	This paper introduces a novel diffusion model framework that leverages correlated noise, particularly blue noise, to enhance the quality of generated images.	Most diffusion models rely solely on Gaussian noise, which may not be optimal for capturing the frequency content during the denoising process. Correlated noise, with its frequency-specific properties, offers a potential solution for improving generation quality.	The authors propose a time-varying noise model that interpolates between Gaussian noise and blue noise throughout the diffusion process. They also introduce a method for fast generation of correlated noise masks using padding, ensuring efficient training. The model is evaluated on various image generation tasks using deterministic diffusion models like IADB.	The proposed method consistently outperforms existing deterministic models, such as IADB and DDIM, on several datasets in terms of FID scores, particularly for resolutions of 64x64. Visual comparisons highlight the superior quality of generated images, particularly in detail-rich areas like hair, eyes, and mouths. Analysis reveals that incorporating blue noise from the middle or later stages of the diffusion process, when low-frequency components are established, yields the best results.	The optimal parameters for the noise scheduler currently depend on the image resolution, requiring further investigation for a more general approach. Extending the model to higher resolutions presents computational challenges for generating correlated noise masks, demanding more efficient solutions.	blue noise, diffusion models, generative modeling, image generation, time-varying noise
2402.04648 Report	OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding	Guibiao Liao, Kaichen Zhou, Zhenyu Bao, Kanglin Liu, Qing Li	The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from SAM to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and Scannet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness.	This paper introduces OV-NeRF, a novel approach for accurate open-vocabulary 3D semantic understanding of Neural Radiance Fields (NeRFs) leveraging the capabilities of pre-trained vision and language foundation models, such as CLIP and Segment Anything (SAM).	Existing methods for extracting semantics from CLIP for NeRF semantic field learning face challenges due to the noisy and view-inconsistent nature of CLIP-derived semantics, hindering accurate 3D semantic understanding.	OV-NeRF tackles these limitations through two key strategies: 1) Region Semantic Ranking (RSR) regularization: employs region proposals from SAM to rectify noisy semantics in each training view, improving the accuracy of single-view relevancy maps. 2) Cross-view Self-enhancement (CSE): addresses view inconsistency by leveraging the 3D consistency of NeRFs, utilizing rendered outputs from the trained semantic field to refine and enhance the consistency of semantic maps across multiple views.	OV-NeRF significantly outperforms state-of-the-art methods, achieving a remarkable improvement of 20.31% and 18.42% in mIoU on Replica and Scannet datasets, respectively. The method exhibits consistent superior performance across various CLIP configurations, indicating its robustness and generalizability. Ablation studies confirm the effectiveness of both RSR and CSE strategies in enhancing the accuracy and view consistency of semantic understanding in NeRFs.	The reliance on pre-computed CLIP features and SAM proposals could introduce limitations in scenarios with significant domain shifts. Future work could explore extending OV-NeRF to handle dynamic scenes and incorporate temporal consistency.	neural radiance fields, 3d semantic segmentation, open-vocabulary learning, vision and language models, clip
2402.04630 Report	LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors	Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu	Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.	This paper introduces DVDet, a novel open-vocabulary object detection method that leverages fine-grained textual descriptors to improve region-text alignment.	Existing open-vocabulary detectors underutilize the knowledge in VLMs by focusing solely on category-level alignment and neglecting the fine-grained descriptor-level alignment where VLMs excel.	DVDet utilizes a Conditional Context regional Prompt (CCP) to transform region embeddings into image-like representations for improved integration with existing detectors. It also employs a hierarchical descriptor generation mechanism that iteratively interacts with LLMs to refine fine-grained descriptors for precise region-text alignment.	DVDet consistently outperforms state-of-the-art open-vocabulary detectors on COCO and LVIS benchmarks. The iterative interaction with LLMs for descriptor generation proves superior to using LLMs as a static knowledge base. DVDet demonstrates strong generalization ability, showing improvements when transferred to PASCAL VOC and LVIS datasets even without re-training.	The descriptor generation process relies on the performance of LLMs, which can be a bottleneck. The method primarily focuses on improving classification accuracy, and future work could explore incorporating fine-grained descriptors into the localization branch.	open-vocabulary object detection, vision language models, large language models, prompt learning, fine-grained descriptors
2402.04625 Report	Noise Map Guidance: Inversion with Spatial Context for Real Image Editing	Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, Yonghyun Jeong	Text-guided diffusion models have become a popular tool in image synthesis, known for producing high-quality and diverse images. However, their application to editing real images often encounters hurdles primarily due to the text condition deteriorating the reconstruction quality and subsequently affecting editing fidelity. Null-text Inversion (NTI) has made strides in this area, but it fails to capture spatial context and requires computationally intensive per-timestep optimization. Addressing these challenges, we present Noise Map Guidance (NMG), an inversion method rich in a spatial context, tailored for real-image editing. Significantly, NMG achieves this without necessitating optimization, yet preserves the editing quality. Our empirical investigations highlight NMG's adaptability across various editing techniques and its robustness to variants of DDIM inversions.	This paper introduces Noise Map Guidance (NMG), an inversion method for real-image editing with text-guided diffusion models that preserves spatial context without requiring optimization.	Existing text-guided diffusion models struggle to edit real images due to the deterioration of reconstruction quality stemming from text conditions. While Null-text Inversion (NTI) addresses this, it requires computationally intensive per-timestep optimization and can fail to capture spatial context.	NMG leverages latent variables from DDIM inversion, referred to as 'noise maps,' which inherently capture spatial context. By conditioning the reverse process on noise maps and reformulating it using energy guidance, NMG guides the reconstruction path to align with the DDIM inversion trajectory.	NMG effectively preserves spatial context during real-image editing, surpassing DDIM, NTI, NPI, and ProxNPI in qualitative and quantitative evaluations. NMG shows consistent robustness across variations of DDIM inversion, as demonstrated by its integration with pix2pix-zero for image-to-image translation. Evaluations using CLIPScore, TIFA, and a user study confirm that NMG achieves high editing quality, aligning with human perception of image fidelity.	NMG faces challenges integrating with methods that deviate from the inversion-based editing paradigm, such as SGC-Net for relationship change tasks. NMG's reliance on text for image editing limits its ability to perform precise spatial changes, such as removing specific individuals or adding objects at exact locations.	image editing, diffusion models, spatial context, inversion, noise map guidance
2402.04618 Report	Multi-Scale Semantic Segmentation with Modified MBConv Blocks	Xi Chen, Yang Cai, Yuan Wu, Bo Xiong, Taesung Park	Recently, MBConv blocks, initially designed for efficiency in resource-limited settings and later adapted for cutting-edge image classification performances, have demonstrated significant potential in image classification tasks. Despite their success, their application in semantic segmentation has remained relatively unexplored. This paper introduces a novel adaptation of MBConv blocks specifically tailored for semantic segmentation. Our modification stems from the insight that semantic segmentation requires the extraction of more detailed spatial information than image classification. We argue that to effectively perform multi-scale semantic segmentation, each branch of a U-Net architecture, regardless of its resolution, should possess equivalent segmentation capabilities. By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets, respectively, demonstrating the efficacy of our proposed modifications in enhancing semantic segmentation performance.	This paper proposes a novel adaptation of MBConv blocks, incorporating modifications in multi-scale segmentation and block structure, to enhance their efficacy in semantic segmentation tasks.	Existing MBConv blocks, despite their success in image classification, remain largely unexplored for semantic segmentation, which requires detailed spatial information extraction, unlike classification.	The study modifies the U-Net architecture by maintaining uniform feature maps and architectural blocks across all scales. It also replaces 1x1 convolutions within MBConv blocks with 3x3 convolutions to capture more spatial context.	Achieved mean Intersection over Union (IoU) scores of 84.5% and 84.0% on Cityscapes test and validation datasets, respectively, outperforming existing methods. Demonstrated that maintaining consistent learning power across scales improves segmentation accuracy. Showed that replacing 1x1 with 3x3 convolutions in MBConv blocks enhances spatial detail capture, despite increasing memory and processing demands.	The modification to MBConv blocks increases memory usage by 10% and processing time by 30%. The switch to 3x3 convolutions increases the number of parameters significantly.	semantic segmentation, mbconv blocks, u-net, multi-scale segmentation, spatial context
2402.04563 Report	Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention	Saebom Leem, Hyunseok Seo	Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.	This paper presents an attention-guided gradient analysis method for Vision Transformer (ViT) to enhance weakly-supervised localization performance.	Proper visualization methods with good localization ability are crucial for utilizing ViT models in various applications, and existing methods often fall short due to ViT's unique architecture.	The method aggregates gradients from the classification output to each self-attention block, guided by self-attention scores normalized with sigmoid. This approach combines high-level semantic information from gradients with patch correlation information from self-attention.	The method outperforms previous ViT visualization techniques in weakly-supervised object detection on ImageNet, PASCAL VOC, and CUB200 datasets. It effectively mitigates peak intensities that hinder accurate localization in other methods. The method excels at capturing full object areas, including multiple instances of the target class.	The method exhibits a slight trade-off between precision and recall compared to some existing methods. Future work can explore incorporating information from non-target classes for further localization improvement.	vision transformer, explainable ai, weakly-supervised localization, class activation map, self-attention
2402.04504 Report	Text2Street: Controllable Text-to-image Generation for Street Views	Jinming Su, Songen Gu, Yiting Duan, Xingyue Chen, Junfeng Luo	Text-to-image generation has made remarkable progress with the emergence of diffusion models. However, it is still a difficult task to generate images for street views based on text, mainly because the road topology of street scenes is complex, the traffic status is diverse and the weather condition is various, which makes conventional text-to-image models difficult to deal with. To address these challenges, we propose a novel controllable text-to-image framework, named \textbf{Text2Street}. In the framework, we first introduce the lane-aware road topology generator, which achieves text-to-map generation with the accurate road structure and lane lines armed with the counting adapter, realizing the controllable road topology generation. Then, the position-based object layout generator is proposed to obtain text-to-layout generation through an object-level bounding box diffusion strategy, realizing the controllable traffic object layout generation. Finally, the multiple control image generator is designed to integrate the road topology, object layout and weather description to realize controllable street-view image generation. Extensive experiments show that the proposed approach achieves controllable street-view text-to-image generation and validates the effectiveness of the Text2Street framework for street views.	Proposes Text2Street, a controllable text-to-image generation framework for street views that controls road topology, traffic status, and weather conditions using text descriptions.	Street-view image generation is valuable for autonomous driving perception and map construction, but existing methods struggle with complex road topology, diverse traffic status, and various weather conditions.	Utilizes three main components: (1) Lane-aware road topology generator (LRTG) creates a local semantic map with lane lines conforming to traffic regulations. (2) Position-based object layout generator (POLG) generates traffic object layout based on text descriptions of object quantity, adhering to traffic rules. (3) Multiple control image generator (MCIG) integrates road topology, object layout, and weather descriptions to produce the final street-view image.	Outperforms state-of-the-art methods in both image fidelity and attribute-level accuracy on nuScenes dataset. Demonstrates superior controllability in generating images with varying road structures, lane lines, traffic objects, and weather conditions. Generated images improve the performance of downstream tasks like object detection.	Relies on fixed camera parameters for image projection, limiting viewpoint diversity. Further exploration of using generated images for other autonomous driving tasks is needed.	text-to-image generation, street view synthesis, controllable image generation, autonomous driving, diffusion models
2402.04492 Report	ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation	Jirayu Burapacheep, Ishan Gaur, Agam Bhatia, Tristan Thrush	This paper introduces the ColorSwap dataset, designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a ``color-swapped'' pair. We follow the Winoground schema: the two captions in an example have the same words, but the color words have been rearranged to modify different objects. The dataset was created through a novel blend of automated caption and image generation with humans in the loop. We evaluate image-text matching (ITM) and visual language models (VLMs) and find that even the latest ones are still not robust at this task. GPT-4V and LLaVA score 72% and 42% on our main VLM metric, although they may improve with more advanced prompting techniques. On the main ITM metric, contrastive models such as CLIP and SigLIP perform close to chance (at 12% and 30%, respectively), although the non-contrastive BLIP ITM model is stronger (87%). We also find that finetuning on fewer than 2,000 examples yields significant performance gains on this out-of-distribution word-order understanding task. The dataset is here: https://github.com/Top34051/colorswap.	The paper introduces ColorSwap, a dataset of 2,000 image-caption pairs designed to assess the ability of multimodal models to match objects with their colors, focusing on understanding word order in captions.	This is important because despite advancements in multimodal models, they still struggle with compositional understanding, particularly in tasks involving word order, which is crucial for tasks like AI-generated art.	The dataset was created using a combination of automated caption and image generation (using GPT-4, Claude-2, Stable Diffusion, Midjourney, and DALL-E 3) and human review for quality control and caption refinement.	Even the latest models like GPT-4V make significant errors on the ColorSwap dataset, highlighting their limitations in color composition understanding. Contrastive models (CLIP, SigLIP) struggle significantly compared to non-contrastive models (BLIP) on this task. Fine-tuning on the ColorSwap dataset significantly improves the performance of CLIP and BLIP, demonstrating their capacity to learn word order understanding from a small, focused dataset.	The dataset focuses on color-object associations, which is a simplification of the broader word order understanding problem. The study primarily focuses on evaluating existing models, and future work could explore novel architectures or training methods specifically designed to address the limitations highlighted by ColorSwap.	multimodal models, word order understanding, compositional reasoning, image-text matching, dataset
2402.04324 Report	ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation	Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, Wenhu Chen	Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.	This paper introduces a novel approach for image-to-video (I2V) generation that enhances video quality and consistency by leveraging spatiotemporal first frame conditioning mechanisms and FrameInit.	Existing I2V generation methods struggle to maintain appearance and motion consistency in generated video sequences. This paper addresses those challenges.	The authors propose spatiotemporal first frame conditioning to leverage both spatial and temporal information from the first frame. They further stabilize the generated video and reduce abrupt changes by integrating FrameInit during inference.	The proposed method significantly outperforms existing open-sourced I2V generation models on benchmark datasets UCF-101 and MSR-VTT in quantitative metrics including FVD, IS, FID and CLIPSIM. The method achieves state-of-the-art performance on the I2V-Bench, demonstrating its capability in generating high-quality and consistent videos. Human evaluation confirms that the proposed method generates videos with superior appearance and motion consistency compared to other baselines.	The model is primarily trained on WebVid-10M, which may limit its generalization ability to videos with unseen domains or styles. Future work can explore incorporating large language models to enable more complex and controllable video generation.	image-to-video generation, video consistency, diffusion models, frameinit, spatiotemporal conditioning
2402.04252 Report	EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters	Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang	Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.	This paper introduces EVA-CLIP-18B, the largest open-source CLIP model to date, with 18 billion parameters, achieving state-of-the-art zero-shot performance on various image and video classification benchmarks.	Scaling up CLIP models is crucial for enhancing visual and multimodal understanding, bridging the gap between vision models and large language models.	The authors leverage a weak-to-strong vision scaling approach, pre-training a large EVA model as the vision encoder initialization for EVA-CLIP and scaling up the model size progressively.	EVA-CLIP-18B achieves 80.7% average zero-shot top-1 accuracy on 27 image classification benchmarks, outperforming previous open-source CLIP models. The model demonstrates significant improvements in zero-shot video classification, surpassing other models by a large margin. Scaling up EVA-CLIP consistently enhances performance with no sign of saturation, suggesting potential for further vision model scaling.	The training dataset, while large, is smaller than some used in other state-of-the-art CLIP models. Future work can explore larger and more diverse datasets to further improve performance and generalization.	clip, multimodal learning, vision scaling, zero-shot learning, image classification
2402.04236 Report	CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations	Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang	Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, \textbf{CogCoM}, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM.	This paper introduces Chain of Manipulations (CoM), a mechanism that enables Vision-Language Models (VLMs) to solve problems step-by-step with evidence by actively manipulating visual inputs.	Existing VLMs often ignore essential visual reasoning steps, leading to failures in meticulous visual problems and unfaithful responses. CoM addresses this by mimicking human-like problem-solving with visual evidence.	The paper proposes (1) a flexible CoM data structure, (2) an automated data generation pipeline using LLMs and VFMs, (3) a memory-based multi-turn multi-image VLM architecture, and (4) a training process incorporating CoM data.	CogCoM achieves state-of-the-art performance on 9 benchmarks across 4 categories (detailed VQA, visual grounding, general multimodal capabilities, hallucination). Significant accuracy improvements are observed on detailed VQA and grounding benchmarks (up to 9.0 and 1.09 points, respectively). CogCoM produces informative reasoning content without significant time overhead compared to baseline models.	The diversity of linguistic solving steps and accuracy of visual tools are limited, leading to negative reasoning paths. Re-inputting manipulated images with hard prompts causes speed losses, which can be improved by implementing manipulations in vector space.	vision-language models, visual reasoning, chain of manipulations, multimodal understanding, data augmentation
2402.04009 Report	Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning	Ningyuan Tang, Minghao Fu, Ke Zhu, Jianxin Wu	In finetuning a large pretrained model to downstream tasks, parameter-efficient fine-tuning (PEFT) methods can effectively finetune pretrained models with few trainable parameters, but suffer from high GPU memory consumption and slow training speed. Because learnable parameters from these methods are entangled with the pretrained model, gradients related to the frozen pretrained model's parameters have to be computed and stored during finetuning. We propose Low-rank Attention Side-Tuning (LAST), which disentangles the trainable module from the pretrained model by freezing not only parameters but also outputs of the pretrained network. LAST trains a side-network composed of only low-rank self-attention modules. By viewing the pretrained model as a frozen feature extractor, the side-network takes intermediate output from the pretrained model and focus on learning task-specific knowledge. We also show that LAST can be highly parallel across multiple optimization objectives, making it very efficient in downstream task adaptation, for example, in finding optimal hyperparameters. LAST outperforms previous state-of-the-art methods on VTAB-1K and other visual adaptation tasks with roughly only 30\% of GPU memory footprint and 60\% of training time compared to existing PEFT methods, but achieves significantly higher accuracy.	This paper proposes LAST (Low-rank Attention Side-Tuning), a novel parameter-efficient fine-tuning (PEFT) method that disentangles trainable parameters from the pretrained model by freezing both parameters and outputs of the pretrained network, leading to lower GPU memory consumption and faster training speed.	Existing PEFT methods, though effective in fine-tuning pretrained models with few trainable parameters, suffer from high GPU memory consumption and slow training speed due to entanglement of trainable parameters and the frozen model.	LAST introduces a side-network comprised of low-rank self-attention modules that operate on intermediate outputs of the frozen pretrained model, focusing on learning task-specific knowledge without modifying the pretrained model's parameters or computation graph.	LAST outperforms state-of-the-art PEFT methods on the VTAB-1K benchmark, achieving higher accuracy with significantly reduced GPU memory footprint (around 30% of other methods) and training time (around 60%). The study demonstrates the surprising effectiveness of low-rank self-attention with very low dimensionality for downstream vision tasks, challenging the necessity of large feed-forward networks in side-tuning. LAST's architecture enables highly efficient parallel training, facilitating hyperparameter search by allowing simultaneous fine-tuning of multiple models with different hyperparameter sets.	One limitation is the lack of convenient transferability of LAST to other backbone networks beyond Transformers. Future work includes extending LAST to other model architectures like ResNet and DenseNet, as well as to different visual adaptation tasks like object detection and image generation.	parameter-efficient fine-tuning, side-tuning, vision transformers, low-rank attention, parallel training
2402.03908 Report	EscherNet: A Generative Model for Scalable View Synthesis	Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison	We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.	Introduces EscherNet, a multi-view conditioned diffusion model for view synthesis that allows precise camera control and generalization across synthetic and real-world images.	Existing view synthesis methods are often scene-specific or limited in handling varying input information. EscherNet addresses these limitations by learning implicit 3D representations and accommodating varying levels of input information.	EscherNet leverages a transformer architecture with a specialized camera positional encoding (CaPE) to capture relationships between reference and target views, enabling consistent and scalable view synthesis.	Significantly outperforms existing 3D diffusion models in view synthesis quality on GSO and RTMV datasets. Generates plausible novel views in a zero-shot manner on NeRF Synthetic dataset, outperforming scene-specific methods with limited reference views. Achieves superior 3D reconstruction quality compared to other image-to-3D generative models on GSO dataset.	Current implementation is limited to a 3 DoF setting due to training dataset constraints. Autoregressive generation, while faster, leads to degraded quality due to content drifting.	view synthesis, diffusion models, 3d reconstruction, camera positional encoding, generative modeling
2402.03766 Report	MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen	We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .	Introduces MobileVLM V2, a family of significantly improved vision language models for mobile scenarios, achieving state-of-the-art performance with faster inference speed.	Enabling capable vision language models on real-world applications like mobile devices, self-driving cars, and embodied AI systems.	Leverages novel architectural design (lightweight downsample projector LDPv2), improved training scheme (training projector and language model throughout), and high-quality dataset curation (ShareGPT4V, ScienceQA, TextVQA, SBU, etc.).	MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. MobileVLM V2 3B model outperforms a large variety of VLMs at the 7B+ scale. Demonstrates lower inference latency than counterparts on NVIDIA AGX Jetson Orin platform.	Exploring even more powerful small language models based on open-source datasets. Investigating methods for effectively utilizing high-resolution input for tasks involving small objects.	vision language models, mobile ai, efficient deep learning, multimodal learning, computer vision
2402.03723 Report	Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos	Alfredo Rivero, ShahRukh Athar, Zhixin Shu, Dimitris Samaras	Creating controllable 3D human portraits from casual smartphone videos is highly desirable due to their immense value in AR/VR applications. The recent development of 3D Gaussian Splatting (3DGS) has shown improvements in rendering quality and training efficiency. However, it still remains a challenge to accurately model and disentangle head movements and facial expressions from a single-view capture to achieve high-quality renderings. In this paper, we introduce Rig3DGS to address this challenge. We represent the entire scene, including the dynamic subject, using a set of 3D Gaussians in a canonical space. Using a set of control signals, such as head pose and expressions, we transform them to the 3D space with learned deformations to generate the desired rendering. Our key innovation is a carefully designed deformation method which is guided by a learnable prior derived from a 3D morphable model. This approach is highly efficient in training and effective in controlling facial expressions, head positions, and view synthesis across various captures. We demonstrate the effectiveness of our learned deformation through extensive quantitative and qualitative experiments. The project page can be found at http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html	Introduces Rig3DGS, a method for creating reanimatable 3D human portraits with controllable facial expressions and head pose from monocular phone videos.	Creating such controllable portraits from casual videos is highly desirable for AR/VR applications but challenging due to the need to accurately disentangle facial deformations from head movements in single-view captures.	Represents the scene using 3D Gaussians in a canonical space, deformed by a learned prior based on a 3D morphable model (FLAME). This deformation is guided by predicted weights for each Gaussian, determined by its proximity to vertices on the FLAME mesh.	Achieves higher-quality renderings than prior work (RigNeRF, INSTA, PointAvatar) with greater fidelity to facial expressions and head poses. Demonstrates successful novel view synthesis of the entire scene while maintaining high fidelity to the target expression and head pose. Shows that the learnable deformation prior is crucial for generalization to novel expressions and head poses compared to fixed priors or no prior.	Limitations include an inability to model strong non-uniform illumination. Requires the subject to remain relatively still during capture. Future work will address these limitations.	3d human reconstruction, neural rendering, 3d gaussian splatting, facial expression control, novel view synthesis
2402.03445 Report	Denoising Diffusion via Image-Based Rendering	Titas Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson	Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.	The paper introduces GIBR, the first denoising diffusion model capable of generating and reconstructing large-scale, detailed 3D scenes from 2D images.	Existing 3D scene generation methods struggle with real-world scenes due to limitations in representing large, detailed scenes, reliance on scarce 3D datasets, and difficulty in sampling from complex scene distributions.	GIBR uses a novel image-based 3D scene representation (IB-planes) that adapts its capacity based on image details. It employs a multi-view denoising diffusion framework with a 3D-consistent denoising mechanism and dropout of neural representations during training to prevent trivial solutions.	GIBR outperforms baselines in 3D reconstruction from single and multiple images, generating plausible details in unobserved regions. GIBR successfully generates coherent and detailed 3D scenes unconditionally, demonstrating its ability to learn a strong prior over 3D scenes from 2D images. Ablation studies confirm the importance of key design choices such as IB-planes, representation dropout, and cross-view attention.	The model currently assumes static scenes and does not handle dynamic elements. Despite approximations, training GIBR remains computationally demanding compared to 2D diffusion models due to volumetric rendering.	3d scene generation, denoising diffusion models, image-based rendering, multi-view reconstruction, neural scene representation
2402.03328 Report	Visual Enumeration is Challenging for Large-scale Generative AI	Alberto Testolin, Kuinan Hou, Marco Zorzi	Humans can readily judge the number of objects in a visual scene, even without counting, and such a skill has been documented in many animal species and babies prior to language development and formal schooling. Numerical judgments are error-free for small sets, while for larger collections responses become approximate, with variability increasing proportionally to the target number. This response pattern is observed for items of all kinds, despite variation in object features (such as color or shape), suggesting that our visual number sense relies on abstract representations of numerosity. Here, we investigate whether large-scale generative Artificial Intelligence (AI) systems have a human-like number sense, which should allow them to reliably name the number of objects in simple visual stimuli or generate images containing a target number of items in the 1-10 range. Surprisingly, most of the foundation models considered have a poor number sense: They make striking errors even with small numbers, the response variability does not increase in a systematic way, and the pattern of errors depends on object category. Only the most recent proprietary systems exhibit signatures of a visual number sense. Our findings demonstrate that having an intuitive visual understanding of number remains challenging for foundation models, which in turn might be detrimental to the perceptual grounding of numeracy that in humans is crucial for mathematical learning.	The paper investigates whether large-scale generative AI systems possess a human-like number sense, enabling them to accurately judge and generate images with specific numbers of objects.	Understanding visual numerosity is crucial for mathematical learning in humans, and this study aims to assess if advanced AI models exhibit similar capabilities.	The researchers tested several foundation models, including ViLT, BLIP-2, GPT-4V, Gemini, Stable Diffusion, and DALL-E, using numerosity naming (identifying the number of objects in an image) and numerosity production (generating images with a target number of objects) tasks.	Most foundation models struggle with visual numerosity, even for small numbers, indicating a limited number sense compared to humans. Only the most recent models, GPT-4V and DALL-E 3, show signs of human-like number sense, exhibiting subitizing for small numbers and sometimes following Weber's law for larger numbers. Many models exhibit response variability that does not align with the psychophysics of human numerosity perception, suggesting a lack of abstract numerical understanding.	The study primarily focused on a limited numerical range (1-10) and specific object categories. The closed-source nature of proprietary models (GPT-4V, DALL-E, Gemini) limits insights into the underlying mechanisms of their numerosity processing.	foundation models, machine vision, numerical cognition, deep learning, generative ai
2402.03327 Report	Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models	Dingning Liu, Xiaoshui Huang, Yuenan Hou, Zhihui Wang, Zhenfei Yin, Yongshun Gong, Peng Gao, Wanli Ouyang	In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.	Uni3D-LLM, a novel unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes, allowing for precise, language-guided manipulation of 3D objects.	Existing methods for integrating LLMs into 3D scene processing suffer from limitations such as inaccurate spatial understanding, occlusion issues, and lack of scene-level alignment, highlighting the importance of a unified framework for enhanced efficiency and collaborative work.	The framework aligns point cloud and image data with text using modality-specific projectors, maps LLM semantic features to a generation model (DreamGaussian), and enables iterative 3D model editing using InstructPix2Pix.	Uni3D-LLM effectively performs grounding tasks by incorporating image features as spatial assistance, overcoming limitations of using point cloud data alone. Adding object-level image information significantly improves object classification accuracy, emphasizing the importance of multi-modal data. Introducing a perception module (Lora) does not negatively impact generation, showcasing the synergistic effects of combining multiple 3D tasks.	Enhancing the positioning capability of point clouds for improved accuracy. Addressing limitations inherited from DreamGaussian and InstructPix2Pix, such as generating large-scale scenes and performing freeform editing.	3d perception, point cloud generation, 3d object editing, large language models (llms), multimodal learning
2402.03310 Report	V-IRL: Grounding Virtual Intelligence in Real Life	Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, Saining Xie	There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce V-IRL: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe.	This paper introduces \virl, a platform that enables AI agents to interact with the real world using a virtual replica built from real-world geospatial and street-view data.	Existing AI agents often lack grounding in the sensory richness of the real world. \virl bridges this realism gap, paving the way for agents that can effectively sense, think, and act in real-world scenarios.	\virl leverages the Google Maps Platform to create a navigable virtual environment. Agents interact with this environment through various components, including modules for geolocation, street-view imagery, movement, mapping, place information retrieval, vision (perception), and language (reasoning & collaboration).	Open-world vision models exhibit significant biases towards frequently observed place types, highlighting the need for more diverse and representative data. Scaling model size significantly improves performance on both place recognition and visual question answering tasks, emphasizing the importance of model capacity. Vision models are particularly challenged in non-English speaking regions, indicating potential linguistic biases in existing models.	The current version of \virl primarily focuses on navigation and place recognition tasks, and could be extended to encompass a broader range of real-world interactions, such as object manipulation. Further research is required to mitigate model biases and enhance robustness to noisy visual observations, particularly for complex tasks like vision-language navigation.	ai agents, embodied ai, vision-language navigation, open-world vision, global benchmarks
2402.03307 Report	4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes	Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, Baoquan Chen	We consider the problem of novel view synthesis (NVS) for dynamic scenes. Recent neural approaches have accomplished exceptional NVS results for static 3D scenes, but extensions to 4D time-varying scenes remain non-trivial. Prior efforts often encode dynamics by learning a canonical space plus implicit or explicit deformation fields, which struggle in challenging scenarios like sudden movements or capturing high-fidelity renderings. In this paper, we introduce 4D Gaussian Splatting (4DGS), a novel method that represents dynamic scenes with anisotropic 4D XYZT Gaussians, inspired by the success of 3D Gaussian Splatting in static scenes. We model dynamics at each timestamp by temporally slicing the 4D Gaussians, which naturally compose dynamic 3D Gaussians and can be seamlessly projected into images. As an explicit spatial-temporal representation, 4DGS demonstrates powerful capabilities for modeling complicated dynamics and fine details, especially for scenes with abrupt motions. We further implement our temporal slicing and splatting techniques in a highly optimized CUDA acceleration framework, achieving real-time inference rendering speeds of up to 277 FPS on an RTX 3090 GPU and 583 FPS on an RTX 4090 GPU. Rigorous evaluations on scenes with diverse motions showcase the superior efficiency and effectiveness of 4DGS, which consistently outperforms existing methods both quantitatively and qualitatively.	This paper presents 4D Gaussian Splatting, a novel approach for novel view synthesis of dynamic scenes, by representing them using anisotropic 4D Gaussians.	Efficient and accurate novel view synthesis for dynamic scenes is crucial for various applications but remains challenging due to the complexities of the temporal dimension and diverse motion patterns.	The method models dynamics by temporally slicing 4D Gaussians, which naturally compose dynamic 3D Gaussians, and utilizes a highly optimized CUDA acceleration framework for real-time rendering speeds.	The method achieves state-of-the-art rendering quality, outperforming prior arts in PSNR and SSIM metrics on both Plenoptic Video and D-NeRF datasets. It achieves unprecedented rendering speed of up to 277 FPS on an RTX 3090 GPU and 583 FPS on an RTX 4090 GPU, significantly surpassing previous methods. The proposed entropy and 4D consistency losses are shown to effectively improve rendering quality by reducing floaters and enhancing motion consistency.	While the method effectively reduces artifacts like floaters and inconsistent motions, challenges remain in constraining 4D Gaussians due to increased dimensions. Future work includes exploring the use of 4D Gaussians for downstream tasks such as tracking and dynamic scene generation.	novel view synthesis, dynamic scenes, 4d gaussian splatting, real-time rendering, cuda acceleration
2402.03302 Report	Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining	Jiarun Liu, Hao Yang, Hong-Yu Zhou, Yan Xi, Lequan Yu, Yizhou Yu, Yong Liang, Guangming Shi, Shaoting Zhang, Hairong Zheng, Shanshan Wang	Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their attention mechanism. Recently, Mamba-based models have gained great attention for their impressive ability in long sequence modeling. Several studies have demonstrated that these models can outperform popular vision models in various tasks, offering higher accuracy, lower memory consumption, and less computational burden. However, existing Mamba-based models are mostly trained from scratch and do not explore the power of pretraining, which has been proven to be quite effective for data-efficient medical image analysis. This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks, leveraging the advantages of ImageNet-based pretraining. Our experimental results reveal the vital role of ImageNet-based training in enhancing the performance of Mamba-based models. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba_Enc by an average score of 2.72%.	This paper introduces Swin-UMamba, a novel Mamba-based UNet model for medical image segmentation that leverages ImageNet-based pretraining.	Accurate medical image segmentation requires efficient modeling of long-range dependencies, which remains a challenge for existing CNN and ViT models. Mamba-based models offer a promising solution, but their potential with pretraining in medical image segmentation is underexplored.	The authors designed Swin-UMamba to integrate a Mamba-based encoder pretrained on ImageNet with a UNet-like decoder. They also proposed a variant, Swin-UMamba†, with a Mamba-based decoder for efficiency. Experiments were conducted on AbdomenMRI, Endoscopy, and Microscopy datasets.	Swin-UMamba significantly outperformed CNN, ViT, and existing Mamba-based models on all datasets. ImageNet-based pretraining substantially improved performance, especially on smaller datasets, highlighting its importance for Mamba-based models. Swin-UMamba† achieved competitive results with fewer parameters and lower FLOPs, demonstrating the potential of Mamba in resource-constrained settings.	The study focused on 2D segmentation and may not generalize directly to 3D medical images. Further hyperparameter tuning and exploration of different pretraining strategies could potentially improve performance.	medical image segmentation, imagenet pretraining, mamba, long-range dependency modeling, unet
2402.03290 Report	InstanceDiffusion: Instance-level Control for Image Generation	Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra	Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models, the ScaleU block improves image fidelity, and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably, on the COCO dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$ for box inputs, and 25.4% IoU for mask inputs.	\ours enables precise instance-level control for text-to-image generation by allowing users to specify the location and textual description of each instance in an image.	Existing text-to-image generation models lack control over individual instances within an image. This limits their use in applications that require fine-grained control over image composition, such as design or data generation.	The authors introduce \ours, a model built on top of a frozen pretrained text-to-image diffusion model. It leverages three key components: - UniFusion block: Projects various instance location formats (points, scribbles, boxes, masks) into a unified feature space and fuses them with the visual features of the diffusion model. - ScaleU block: Improves the model's ability to adhere to specified instance locations by dynamically rescaling the skip connection and main features in the UNet. - Multi-instance Sampler (MIS): Reduces information leakage and confusion between multiple instance conditions during inference.	\ours significantly outperforms previous state-of-the-art methods specialized for specific instance conditions on COCO and LVIS datasets. The model exhibits superior attribute binding capability, accurately reflecting instance colors and textures specified in the input prompts. Using multiple location formats simultaneously improves the model's fidelity to instance locations, leading to better image generation results.	The generation quality of smaller objects shows a noticeable gap compared to larger objects. Texture binding remains challenging for all tested methods, including \ours.	text-to-image generation, instance-level control, diffusion models, location conditioning, attribute binding
2402.03286 Report	Training-Free Consistent Text-to-Image Generation	Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon	Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.	This paper proposes Consistory, a training-free method for generating consistent subjects across multiple images with diverse prompts using pre-trained text-to-image diffusion models.	Maintaining visual consistency of subjects across different images generated from varying text prompts is crucial for various applications like storytelling, virtual asset design, and synthetic data creation. Existing methods have limitations such as requiring per-subject training, struggling with multi-subject consistency, or compromising prompt alignment.	Consistory leverages internal feature representations of diffusion models to align generated images during denoising. It employs subject-driven self-attention to share subject-specific information, incorporates vanilla query features and attention dropout for layout diversity, and utilizes feature injection for fine-grained consistency.	Consistory achieves state-of-the-art performance on subject consistency and text alignment without requiring any training. It significantly outperforms existing methods in terms of speed, being approximately 20 times faster. The method can be extended to multi-subject scenarios and training-free personalization for common objects.	Consistory's performance depends on the accuracy of object localization through cross-attention maps, which can be imperfect for unusual styles. The method struggles to disentangle appearance and style, limiting consistent generation to images sharing the same style.	text-to-image synthesis, consistent image generation, diffusion models, self-attention, training-free methods
2402.03251 Report	CLIP Can Understand Depth	Dunam Kim, Seokju Lee	Recent studies on generalizing CLIP for monocular depth estimation reveal that CLIP pre-trained on web-crawled data is inefficient for deriving proper similarities between image patches and depth-related prompts. In this paper, we adapt CLIP for meaningful quality of monocular depth estimation with dense prediction, without fine-tuning its original vision-language alignment. By jointly training a compact deconvolutional decoder with a tiny learnable embedding matrix named mirror, as a static prompt for its text encoder, CLIP is enabled to understand depth. With this approach, our model exhibits impressive performance matching several previous state-of-the-art vision-only models on the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth estimation model with a large margin. Experiments on temporal depth consistency and spatial continuity demonstrate that the prior knowledge of CLIP can be effectively refined by our proposed framework. Furthermore, an ablation study on mirror proves that the resulting model estimates depth utilizing knowledge not only from the image encoder but also text encoder despite not being given any prompt written in a human way. This research demonstrates that through minimal adjustments, the prior knowledge of vision-language foundation models, such as CLIP, can be generalized even to domains where learning during pretraining is challenging. We facilitate future works focused on methods to adjust suboptimal prior knowledge of vision-language models using non-human language prompts, achieving performance on par with task-specific state-of-the-art methodologies.	This paper introduces CLIP2Depth, a framework that adapts a pretrained and frozen CLIP model for monocular dense depth estimation using non-human language supervision.	This research is important because it demonstrates that pretrained vision-language models like CLIP can be effectively generalized to complex domains, such as depth estimation, without requiring direct fine-tuning.	The authors jointly train a compact deconvolutional decoder with a learnable embedding matrix named mirror. Mirror acts as a non-human language prompt, conditioning the CLIP text encoder to understand depth.	CLIP2Depth outperforms all previous CLIP-based depth estimation models on NYU Depth v2 and KITTI datasets. The model achieves performance comparable to state-of-the-art vision-only models while preserving CLIP's task-agnostic characteristics. Ablation studies validate the effectiveness of mirror and the overall design choices of the framework.	The model exhibits a performance gap between NYU Depth v2 and KITTI, suggesting room for improvement in generalizing to unseen domains. Further exploration is needed to better understand and leverage the correlation between human and AI knowledge systems through non-human language prompts.	depth estimation, clip, vision-language models, non-human language prompts, prompt learning
2402.03246 Report	SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM	Mingrui Li, Shuhong Liu, Heng Zhou, Guohao Zhu, Na Cheng, Tianchen Deng, Hongyu Wang	We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.	SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting, incorporates appearance, geometry, and semantic features for enhanced scene understanding and object-level geometry.	Addresses limitations of neural implicit SLAM systems, such as oversmoothing, by leveraging the speed and direct gradient flow of Gaussian Splatting for high-quality rendering, scene understanding, and object-level geometry.	Utilizes multi-channel optimization with appearance, depth, and semantic information. Employs semantic-guided keyframe selection to improve map reconstruction accuracy.	Achieves state-of-the-art performance in camera pose estimation, surpassing baselines in ATE RMSE by up to 34%. Delivers high-fidelity dense map reconstruction, outperforming baselines in PSNR by a margin of 10dB. Provides highly accurate 3D semantic segmentation, exceeding NeRF-based methods by over 10% in mIoU.	Relies on depth and 2D semantic input, limiting performance in environments where this data is scarce. Faces challenges with high memory consumption in large-scale scenes.	slam, 3d reconstruction, semantic segmentation, gaussian splatting, scene understanding
2402.03241 Report	FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition	Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han	In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.	This paper introduces FROSTER, a novel framework for open-vocabulary action recognition that enhances the adaptation of the CLIP model to video data while preserving its generalization capabilities.	Applying CLIP to open-vocabulary action recognition is challenging due to CLIP's training on image-text pairs, which lacks temporal information, leading to suboptimal performance on unseen actions.	FROSTER employs residual feature distillation using a frozen CLIP model as a teacher to guide a student model. This approach balances video-specific learning with the retention of CLIP's generalizability.	FROSTER consistently outperforms previous state-of-the-art methods on both base-to-novel and cross-dataset action recognition benchmarks. The residual feature distillation approach effectively balances the learning of video-specific features while preserving generalization abilities. FROSTER demonstrates compatibility with various network architectures, highlighting its adaptability and effectiveness.	The model's performance on fine-grained action datasets like SSv2 suggests room for improvement in capturing temporal dynamics. Exploring more sophisticated text augmentation techniques to further enhance action understanding is an area for future work.	action recognition, open vocabulary, clip, knowledge distillation, generalizability
2402.03214 Report	Organic or Diffused: Can We Distinguish Human Art from AI-generated Images?	Anna Yoo Jeong Ha, Josephine Passananti, Ronik Bhaskar, Shawn Shan, Reid Southen, Haitao Zheng, Ben Y. Zhao	The advent of generative AI images has completely disrupted the art world. Distinguishing AI generated images from human art is a challenging problem whose impact is growing over time. A failure to address this problem allows bad actors to defraud individuals paying a premium for human art and companies whose stated policies forbid AI imagery. It is also critical for content owners to establish copyright, and for model trainers interested in curating training data in order to avoid potential model collapse. There are several different approaches to distinguishing human art from AI images, including classifiers trained by supervised learning, research tools targeting diffusion models, and identification by professional artists using their knowledge of artistic techniques. In this paper, we seek to understand how well these approaches can perform against today's modern generative models in both benign and adversarial settings. We curate real human art across 7 styles, generate matching images from 5 generative models, and apply 8 detectors (5 automated detectors and 3 different human groups including 180 crowdworkers, 4000+ professional artists, and 13 expert artists experienced at detecting AI). Both Hive and expert artists do very well, but make mistakes in different ways (Hive is weaker against adversarial perturbations while Expert artists produce higher false positives). We believe these weaknesses will remain as models continue to evolve, and use our data to demonstrate why a combined team of human and automated detectors provides the best combination of accuracy and robustness.	This paper explores the effectiveness of different methods for distinguishing between human-created art and AI-generated images.	Identifying AI-generated images is crucial for art authenticity, copyright, preventing fraud, and ensuring the quality of AI model training data.	The study uses a dataset of human and AI art across 7 styles, tested with 5 automated detectors (Hive, Optic, Illuminarty, DIRE, DE-FAKE), and 3 human groups (crowdworkers, artists, experts) under various adversarial conditions.	Hive exhibits the highest accuracy (98%) among all detectors but struggles with adversarially perturbed images, especially those processed with Glaze. Human experts outperform machines in judging Glazed images, leveraging domain knowledge and detecting subtle artistic inconsistencies missed by AI models. Combining human experts and automated detectors, such as Hive, yields a more robust and accurate detection approach, particularly against adversarial examples.	Limited number of expert artists and artworks used in the study, potentially impacting the generalizability of results. Rapid evolution of AI image generation models necessitates constant updates to detectors and the evaluation framework.	ai-generated art, image detection, human perception, adversarial machine learning, art authentication
2402.03161 Report	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu	In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.	This paper introduces Video-LaVIT, a multimodal pre-training method for unified comprehension and generation of videos, images, and language using Large Language Models (LLMs).	Existing multimodal LLMs struggle to effectively encode video data due to the computational cost of capturing complex spatiotemporal dynamics.	Video-LaVIT decomposes videos into keyframes and motion vectors. It employs a novel video tokenizer to represent these components as discrete tokens, enabling unified pre-training with LLMs. A video detokenizer then maps generated tokens back into the continuous pixel space for video generation.	Video-LaVIT achieves state-of-the-art results on various image and video understanding benchmarks. It demonstrates competitive performance on text-to-video and image-to-video generation tasks. The method supports long video generation by progressively decoding multiple short clips while maintaining temporal consistency.	The model's limited context window restricts direct processing of very long videos. Training cost remains high, limiting scalability to massive video datasets.	multimodal learning, large language models, video understanding, video generation, motion tokenization
2402.03119 Report	Good Teachers Explain: Explanation-Enhanced Knowledge Distillation	Amin Parchami-Araghi, Moritz Böhle, Sukrut Rao, Bernt Schiele	Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models. While it is well known that student models can achieve similar accuracies as the teachers, it has also been shown that they nonetheless often do not learn the same function. It is, however, often highly desirable that the student's and teacher's functions share similar properties such as basing the prediction on the same input features, as this ensures that students learn the 'right features' from the teachers. In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student. Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD (e$^2$KD) (1) consistently provides large gains in terms of accuracy and student-teacher agreement, (2) ensures that the student learns from the teacher to be right for the right reasons and to give similar explanations, and (3) is robust with respect to the model architectures, the amount of training data, and even works with 'approximate', pre-computed explanations.	The paper proposes explanation-enhanced knowledge distillation (e^2KD), which encourages student models to not only match teacher model logits but also their explanations (e.g., GradCAM, B-cos), thereby improving distillation fidelity.	Faithful knowledge distillation is crucial for ensuring student models learn the same function and reasoning process as their teachers, leading to improved generalization, robustness to distribution shifts, and interpretability.	e^2KD introduces an explanation similarity loss term alongside the traditional KD loss. This term minimizes the difference between teacher and student explanations, typically using cosine similarity on GradCAM or B-cos explanations.	e^2KD significantly boosts student accuracy and agreement with teachers, especially with limited training data (e.g., ImageNet with 50 shots). Students trained with e^2KD learn to rely on the 'right' features, exhibiting improved robustness to distribution shifts (demonstrated on the Waterbirds dataset). e^2KD effectively transfers desirable explanation properties from teachers to students, including architectural priors (shown by distilling CNN to ViT, resulting in shift-invariant explanations).	The computational cost of e^2KD is higher than vanilla KD due to the additional explanation computations (mitigated by using 'frozen' pre-computed explanations). The effectiveness of e^2KD relies on the quality and faithfulness of the chosen explanation method.	knowledge distillation, explainable ai (xai), model compression, distribution shift, model interpretability
2402.03040 Report	InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions	Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao, Xiangyu Yue	We introduce $\textit{InteractiveVideo}$, a user-centric framework for video generation. Different from traditional generative approaches that operate based on user-provided images or text, our framework is designed for dynamic interaction, allowing users to instruct the generative model through various intuitive mechanisms during the whole generation process, e.g. text and image prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal Instruction mechanism, designed to seamlessly integrate users' multimodal instructions into generative models, thus facilitating a cooperative and responsive interaction between user inputs and the generative process. This approach enables iterative and fine-grained refinement of the generation result through precise and effective user instructions. With $\textit{InteractiveVideo}$, users are given the flexibility to meticulously tailor key aspects of a video. They can paint the reference image, edit semantics, and adjust video motions until their requirements are fully met. Code, models, and demo are available at https://github.com/invictus717/InteractiveVideo	Presents "InteractiveVideo", a user-centric framework for video generation that allows users to iteratively control and refine the generation process using multimodal instructions, such as text prompts, image editing, and motion trajectories.	Existing video generation models, relying on image and text inputs, often fail to fully capture user intentions and offer limited control over the generated content, particularly in terms of complex motion and dynamic scenes.	The framework utilizes two generative pipelines (text-to-image and image-to-video) based on latent diffusion models. It incorporates user interactions (e.g., painting, dragging) as denoising residuals to influence the video denoising process, enabling fine-grained control over video elements.	Allows for personalization of video content by adding or animating objects absent in the original reference image. Enables fine-grained video editing, including regional semantic changes like color and appearance modifications. Exhibits precise motion control, demonstrated through large motion control, precise gesture control, and multi-object motion control.	Ensuring accessibility and intuitive usability across diverse user groups. Maintaining computational efficiency amidst dynamic and diverse user inputs.	video generation, interactive ai, multimodal instructions, user-centric design, diffusion models
2402.02972 Report	Retrieval-Augmented Score Distillation for Text-to-3D Generation	Junyoung Seo, Susung Hong, Wooseok Jang, Inès Hyeonsu Kim, Minseop Kwak, Doyup Lee, Seungryong Kim	Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed ReDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model's 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that ReDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/ReDream/.	This paper proposes ReDream, a retrieval-augmented score distillation framework for text-to-3D generation that enhances the quality and geometric consistency of generated 3D scenes.	Existing text-to-3D methods struggle with geometric inconsistencies or limited fidelity due to insufficient 3D training data. ReDream addresses this by leveraging the strengths of both 2D diffusion models and 3D assets.	ReDream retrieves semantically relevant 3D assets and uses them in two ways: 1) Initializing the variational distribution of the 3D scene to incorporate geometric priors. 2) Lightweight adaptation of the 2D diffusion model for improved view consistency.	ReDream generates high-quality 3D scenes with improved geometric consistency compared to previous text-to-3D methods. The method allows for flexible control over the generation process, influenced by both text prompts and retrieved assets. Quantitative and qualitative evaluations, including a user study, demonstrate the effectiveness of ReDream over existing approaches.	The generation process, while faster than the baseline, is still time-consuming compared to methods focused on fast inference. The ability to handle complex text prompts is limited by the capabilities of the underlying 2D diffusion model.	text-to-3d generation, score distillation sampling, retrieval-augmented generation, 3d consistency, variational inference
2402.02906 Report	ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis	Bernard Spiegl, Andrea Perin, Stéphane Deny, Alexander Ilin	Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with unparalleled flexibility. ViewFusion consists in simultaneously applying a diffusion denoising step to any number of input views of a scene, then combining the noise gradients obtained for each view with an (inferred) pixel-weighting mask, ensuring that for each region of the target scene only the most informative input views are taken into account. Our approach resolves several limitations of previous approaches by (1) being trainable and generalizing across multiple scenes and object classes, (2) adaptively taking in a variable number of pose-free views at both train and test time, (3) generating plausible views even in severely undetermined conditions (thanks to its generative nature) -- all while generating views of quality on par or even better than state-of-the-art methods. Limitations include not generating a 3D embedding of the scene, resulting in a relatively slow inference speed, and our method only being tested on the relatively small dataset NMR. Code is available.	Introduces \textit{\shorthand}, a flexible and pose-free generative approach for novel view synthesis using composable diffusion models with a novel weighting scheme to adaptively handle an arbitrary number of input views.	Existing novel view synthesis methods often require expensive per-scene retraining, struggle with pose-free inputs, or cannot adapt to a variable number of input views at test time. This work aims to address these limitations.	The method utilizes a composable diffusion probabilistic framework where each input view is processed through identical U-Net streams. A learned weighting mechanism infers the importance of each view at each denoising step, composing the final prediction. The approach is trained end-to-end on a dataset of multiple object classes and input view poses.	Achieves state-of-the-art or near state-of-the-art performance on NMR dataset across different metrics. Demonstrates flexibility by handling variable input view counts, generalizing across classes, producing plausible views for occluded regions, and maintaining 3D consistency in autoregressive generation. Shows the model's ability to adaptively shift weighting based on the informativeness of input views for a given target view.	Currently lacks explicit incorporation of 3D semantics, potentially limiting its performance on out-of-distribution scenes. Inference speed scales linearly with the number of input views, which can be computationally expensive for a large number of views or high-resolution images.	novel view synthesis, diffusion models, generative models, pose-free, composable
2402.02887 Report	Time-, Memory- and Parameter-Efficient Visual Adaptation	Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab	As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.	This paper proposes Low-Rank Side Adaptation (LoSA), an efficient adaptation method for large pre-trained models focusing on reducing training time and memory usage without backpropagating gradients through the entire model.	Existing parameter-efficient adaptation methods mainly focus on reducing the number of trained parameters but still require significant training time and memory due to backpropagation through the entire model.	LoSA introduces a lightweight parallel network operating on frozen activations from the pre-trained backbone model, refining features for the target task without backpropagating gradients through the backbone.	LoSA achieves state-of-the-art accuracy-parameter trade-offs on the VTAB benchmark. LoSA demonstrates scalability by adapting a 4-billion parameter vision transformer for video classification, outperforming previous methods while using less memory and training time. Ablation studies validate the design choices of LoSA, such as the use of a low-rank mixer and the selection of backbone activations.	The current work focuses on vision tasks. Further exploration is needed to extend its applicability to other domains. Future research can investigate the extension of LoSA to more complex vision tasks beyond image and video classification.	parameter-efficient finetuning, vision transformers, adaptation methods, training efficiency, memory efficiency
2402.02800 Report	Extreme Two-View Geometry From Object Poses with Diffusion Models	Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu	Human has an incredible ability to effortlessly perceive the viewpoint difference between two images containing the same object, even when the viewpoint change is astonishingly vast with no co-visible regions in the images. This remarkable skill, however, has proven to be a challenge for existing camera pose estimation methods, which often fail when faced with large viewpoint differences due to the lack of overlapping local features for matching. In this paper, we aim to effectively harness the power of object priors to accurately determine two-view geometry in the face of extreme viewpoint changes. In our method, we first mathematically transform the relative camera pose estimation problem to an object pose estimation problem. Then, to estimate the object pose, we utilize the object priors learned from a diffusion model Zero123 to synthesize novel-view images of the object. The novel-view images are matched to determine the object pose and thus the two-view camera pose. In experiments, our method has demonstrated extraordinary robustness and resilience to large viewpoint changes, consistently estimating two-view poses with exceptional generalization ability across both synthetic and real-world datasets. Code will be available at https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models.	This paper introduces a novel algorithm leveraging object priors from diffusion models to estimate the relative camera pose of two images with extreme viewpoint changes, where traditional feature matching methods struggle due to minimal overlapping regions.	Estimating relative camera poses for images with extreme viewpoint differences is crucial for applications like 3D reconstruction and augmented reality, but remains a challenge for existing feature-matching methods.	The algorithm transforms the camera pose estimation into an object pose estimation problem. It utilizes a diffusion model (Zero123) to generate novel-view images of the co-visible object and estimates object poses for input and generated images. Finally, it matches an input image against the generated images to determine the relative camera pose.	The method demonstrates superior accuracy in estimating relative camera poses compared to baseline feature matching and regression-based methods on GSO and Navi datasets. It shows robustness to in-plane rotations by incorporating in-plane rotation estimation in the pipeline. The method shows potential in improving visual odometry (VO) accuracy as demonstrated by an application example.	The method may face challenges in accurately predicting poses for symmetrical objects. Future work could focus on improving the runtime of the algorithm.	camera pose estimation, object pose estimation, diffusion models, extreme viewpoints, two-view geometry
2402.02705 Report	Representation Surgery for Multi-Task Model Merging	Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, Dacheng Tao	Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization. Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training, greatly expanding the application scenarios of MTL. However, by visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias. That is, there is a significant discrepancy in the representation distribution between the merged and individual models, resulting in poor performance of merged MTL. In this paper, we propose a representation surgery solution called "Surgery" to reduce representation bias in the merged model. Specifically, Surgery is a lightweight task-specific module that takes the representation of the merged model as input and attempts to output the biases contained in the representation from the merged model. We then designed an unsupervised optimization objective that updates the Surgery module by minimizing the distance between the merged model's representation and the individual model's representation. Extensive experiments demonstrate significant MTL performance improvements when our Surgery module is applied to state-of-the-art (SOTA) model merging schemes.	This paper identifies and addresses the "representation bias" problem in multi-task model merging, where the merged model's representations differ from individually trained models, leading to performance degradation. It proposes a novel "representation surgery" approach to alleviate this issue.	Model merging for multi-task learning (MTL) offers advantages over traditional MTL by enabling the combination of independently trained models without requiring access to raw training data. However, existing model merging methods often suffer from a performance gap compared to traditional MTL, hindering their effectiveness.	The paper introduces "Surgery," a lightweight, task-specific module added after model merging. Surgery takes the merged model's representation as input and aims to minimize the distance between its output and the corresponding individual model's representation using an unsupervised objective.	Representation bias is shown to exist across tasks, architectures, and merging methods, hindering performance. Surgery effectively reduces representation bias, leading to significant performance improvements across various model merging baselines. The proposed method is lightweight, requiring minimal additional parameters and training iterations.	The current study focuses on ViT architectures. Exploring the effectiveness of representation surgery on other architectures is left for future work. Future work will investigate model merging from different architectures or initializations.	model merging, multi-task learning, representation bias, representation surgery, unsupervised learning
2402.02474 Report	Deep Spectral Improvement for Unsupervised Image Instance Segmentation	Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei	Deep spectral methods reframe the image decomposition process as a graph partitioning task by extracting features using self-supervised learning and utilizing the Laplacian of the affinity matrix to obtain eigensegments. However, instance segmentation has received less attention compared to other tasks within the context of deep spectral methods. This paper addresses the fact that not all channels of the feature map extracted from a self-supervised backbone contain sufficient information for instance segmentation purposes. In fact, Some channels are noisy and hinder the accuracy of the task. To overcome this issue, this paper proposes two channel reduction modules: Noise Channel Reduction (NCR) and Deviation-based Channel Reduction (DCR). The NCR retains channels with lower entropy, as they are less likely to be noisy, while DCR prunes channels with low standard deviation, as they lack sufficient information for effective instance segmentation. Furthermore, the paper demonstrates that the dot product, commonly used in deep spectral methods, is not suitable for instance segmentation due to its sensitivity to feature map values, potentially leading to incorrect instance segments. A new similarity metric called Bray-Curtis over Chebyshev (BoC) is proposed to address this issue. It takes into account the distribution of features in addition to their values, providing a more robust similarity measure for instance segmentation. Quantitative and qualitative results on the Youtube-VIS2019 dataset highlight the improvements achieved by the proposed channel reduction methods and the use of BoC instead of the conventional dot product for creating the affinity matrix. These improvements are observed in terms of mean Intersection over Union and extracted instance segments, demonstrating enhanced instance segmentation performance. The code is available on: https://github.com/farnooshar/SpecUnIIS	This paper proposes two channel reduction modules, Noise Channel Reduction (NCR) and Deviation-based Channel Reduction (DCR), and a new similarity metric, Bray-Curtis over Chebyshev (BoC), to improve deep spectral methods for unsupervised image instance segmentation.	Existing deep spectral methods for instance segmentation struggle because not all feature map channels from self-supervised backbones are informative, and the commonly used dot product for affinity matrix creation is sensitive to feature values and ignores feature distribution.	NCR removes noisy channels based on entropy. DCR further reduces channels based on standard deviation to prioritize informative features for instance segmentation. BoC leverages Bray-Curtis and Chebyshev distances to consider both feature distribution and values in affinity matrix creation.	NCR improved Fg-Bg segmentation F-score by up to 3% on YouTube-VIS2019 and PascalVOC 2012 datasets. BoC outperformed other similarity metrics, achieving 2% higher mIoU than the dot product on instance segmentation. The proposed method, combining NCR, DCR, and BoC, demonstrated robustness in handling occlusions and variations in object sizes.	Exploring alternative channel reduction techniques beyond entropy and standard deviation. Investigating the integration of the proposed method within a supervised learning framework for potentially improved channel selection.	deep spectral methods, image instance segmentation, self-supervised learning, unsupervised learning, transformer models
2402.02453 Report	AI Art Neural Constellation: Revealing the Collective and Contrastive State of AI-Generated and Human Art	Faizan Farooq Khan, Diana Kim, Divyansh Jha, Youssef Mohamed, Hanna H Chang, Ahmed Elgammal, Luba Elliott, Mohamed Elhoseiny	Discovering the creative potentials of a random signal to various artistic expressions in aesthetic and conceptual richness is a ground for the recent success of generative machine learning as a way of art creation. To understand the new artistic medium better, we conduct a comprehensive analysis to position AI-generated art within the context of human art heritage. Our comparative analysis is based on an extensive dataset, dubbed ``ArtConstellation,'' consisting of annotations about art principles, likability, and emotions for 6,000 WikiArt and 3,200 AI-generated artworks. After training various state-of-the-art generative models, art samples are produced and compared with WikiArt data on the last hidden layer of a deep-CNN trained for style classification. We actively examined the various art principles to interpret the neural representations and used them to drive the comparative knowledge about human and AI-generated art. A key finding in the semantic analysis is that AI-generated artworks are visually related to the principle concepts for modern period art made in 1800-2000. In addition, through Out-Of-Distribution (OOD) and In-Distribution (ID) detection in CLIP space, we find that AI-generated artworks are ID to human art when they depict landscapes and geometric abstract figures, while detected as OOD when the machine art consists of deformed and twisted figures. We observe that machine-generated art is uniquely characterized by incomplete and reduced figuration. Lastly, we conducted a human survey about emotional experience. Color composition and familiar subjects are the key factors of likability and emotions in art appreciation. We propose our whole methodologies and collected dataset as our analytical framework to contrast human and AI-generated art, which we refer to as ``ArtNeuralConstellation''. Code is available at: https://github.com/faixan-khan/ArtNeuralConstellation	This paper presents "ArtNeuralConstellation," an analytical framework to contrast AI-generated and human art using art principles, time analysis, and emotional responses.	With the rise of AI-generated art, understanding its differences and similarities to human-created art is crucial for appreciating this new medium and its place within art history.	The authors analyze a dataset of 6,000 WikiArt and 3,200 AI-generated artworks, evaluating them based on Wölfflin's art principles, general art principles, out-of-distribution detection in CLIP space, time period similarity, and emotional responses from human surveys.	AI-generated art leans towards visual concepts associated with modern art (1800-2000) and Baroque styles. Landscapes and geometric abstractions in AI art are often indistinguishable from human art, while deformed figures are identified as distinctly AI-generated. AI art evokes a diverse range of emotions comparable to human art, with likability linked to successful depictions of familiar subjects like landscapes and portraits.	The study primarily focuses on AI art generated without human intervention, limiting insights into collaborative human-AI art creation. Future work can expand the analysis to non-Western art and explore additional art principles beyond those considered in this study.	ai-generated art, art history, computational aesthetics, deep learning, emotional analysis
2402.02369 Report	M$^3$Face: A Unified Multi-Modal Multilingual Framework for Human Face Generation and Editing	Mohammadreza Mofayezi, Reza Alipour, Mohammad Ali Kakavand, Ehsaneddin Asgari	Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.	M extsuperscript{3}Face, a unified multi-modal multilingual framework for controllable face generation and editing, simplifies multi-modal generation by automatically creating conditioning modalities (e.g., semantic segmentation, facial landmarks) from text input.	Existing multi-modal methods, while powerful, require manual creation of conditioning modalities, which is complex for users. M extsuperscript{3}Face addresses this by automatically generating these modalities from text, enhancing user experience and accessibility.	The framework uses a masked transformer model (Muse) to generate conditioning modalities from text. Then, ControlNet generates face images from these modalities. For editing, inpainting edits the modalities, followed by Imagic manipulation using the trained ControlNet models. M extsuperscript{3}CelebA Dataset, a large-scale multi-modal and multilingual face dataset, is introduced to train and evaluate the framework.	Generates realistic face images from both multi-modal conditions and text prompts, capturing intricate details like hair style, glasses, and emotions. Enables consistent and controllable face editing using text, masks, landmarks, or a combination thereof, surpassing baselines in preserving identity and adhering to target prompts. Outperforms existing methods in quantitative metrics like FID, CLIP Score, and directional CLIP similarity for both face generation and editing tasks.	The quality of Muse-generated segmentation and landmarks can affect the final image quality. The performance relies heavily on the Stable Diffusion backbone used in ControlNet, and exploring more robust backbones could further enhance results.	face generation, face editing, multi-modal generation, diffusion models, multilingual
2402.02352 Report	Region-Based Representations Revisited	Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman T V, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, Derek Hoiem	We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.	This paper investigates the effectiveness of region-based representations for various recognition tasks, leveraging class-agnostic segmenters like SAM and strong self-supervised representations like DINOv2.	Region-based representations offer advantages such as scalability, flexibility, and interpretability, enabling applications like custom image retrieval, interactive learning, and multi-image inference.	The methodology involves generating regions using SAM and SLIC, extracting image features using DINOv2, pooling features within masks to create region representations, and employing these representations for semantic segmentation, object retrieval, and activity classification.	Region-based representations with simple linear decoders achieve competitive performance on semantic segmentation, outperforming patch-based approaches. One-shot object-based image retrieval using region representations significantly surpasses single-token representations like DINOv2 and CLIP. Region-based representations prove beneficial for multi-frame activity classification, allowing for efficient processing of multiple frames and capturing temporal dynamics.	The current speed of SAM for region generation can be a bottleneck for real-time applications. Future work could explore incorporating additional information into region features, such as human pose or optical flow, to further enhance their representational power for tasks like activity recognition.	region-based representation, semantic segmentation, object retrieval, activity classification, self-supervised learning
2402.02209 Report	On the Exploitation of DCT-Traces in the Generative-AI Domain	Orazio Pontorno, Luca Guarnera, Sebastiano Battiato	Deepfakes represent one of the toughest challenges in the world of Cybersecurity and Digital Forensics, especially considering the high-quality results obtained with recent generative AI-based solutions. Almost all generative models leave unique traces in synthetic data that, if analyzed and identified in detail, can be exploited to improve the generalization limitations of existing deepfake detectors. In this paper we analyzed deepfake images in the frequency domain generated by both GAN and Diffusion Model engines, examining in detail the underlying statistical distribution of Discrete Cosine Transform (DCT) coefficients. Recognizing that not all coefficients contribute equally to image detection, we hypothesize the existence of a unique "discriminative fingerprint", embedded in specific combinations of coefficients. To identify them, Machine Learning classifiers were trained on various combinations of coefficients. In addition, the Explainable AI (XAI) LIME algorithm was used to search for intrinsic discriminative combinations of coefficients. Finally, we performed a robustness test to analyze the persistence of traces by applying JPEG compression. The experimental results reveal the existence of traces left by the generative models that are more discriminative and persistent at JPEG attacks.	This paper presents an analysis of Discrete Cosine Transform (DCT) coefficients to identify unique traces left by GAN and Diffusion Model based deepfakes.	Identifying these traces is crucial for improving deepfake detection methods and overcoming their generalization limitations.	The authors analyze the statistical distribution of DCT coefficients, particularly the AC statistics (β^AC), from real, GAN-generated, and Diffusion Model-generated images. They use machine learning classifiers trained on different β^AC subsets and the XAI algorithm LIME to pinpoint the most discriminative coefficients. The persistence of these traces is further evaluated under JPEG compression.	β^AC coefficients effectively distinguish between real, GAN, and Diffusion Model generated images. Specific subsets of β^AC coefficients, particularly those identified by LIME, exhibit high discriminative power. The discriminative power of high-frequency β^AC coefficients diminishes with JPEG compression, while low-frequency coefficients retain some discriminative ability.	The study primarily focuses on low-resolution images. Further investigation is needed to explore the persistence of low-frequency β^AC traces under stronger compression and other image manipulations.	deepfakes, multimedia forensics, synthetic traces, discrete cosine transform, explainable ai
2402.01950 Report	ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields	Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng	Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.	ConRF: a novel NeRF-based method for zero-shot 3D scene stylization using text or image as a single reference.	Existing 3D scene stylization methods are limited to known styles or require retraining for new styles. ConRF offers flexibility and control by enabling zero-shot transfer with either text or image references.	ConRF leverages a pre-trained CLIP encoder to extract features and maps them to the style space of a pre-trained VGG network using a mapping network. It employs a 3D selection volume for localized style manipulation based on text prompts.	ConRF achieves zero-shot 3D style transfer using single-text or single-image references. It outperforms existing methods in terms of visual quality and consistency across multiple views. ConRF allows for localized style transfer based on text prompts, enabling fine-grained control over stylization.	The method's performance is limited by the capabilities of the pre-trained CLIP model, which may not always accurately capture subtle style nuances. The local style transfer is currently limited to face-forwarding scenes and may require further development for broader applicability. Future work will explore incorporating generative models for enhanced creative capabilities.	nerf, style transfer, zero-shot learning, clip, 3d scene stylization
2402.01832 Report	SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?	Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem	We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released at https://github.com/hammoudhasan/SynthCLIP	This paper presents SynthCLIP, a novel framework for training CLIP models using entirely synthetic text-image pairs generated through a pipeline leveraging text-to-image networks and large language models.	Training CLIP traditionally relies on large, web-scraped datasets that suffer from noise, imbalanced representation, and potential safety concerns. SynthCLIP addresses these limitations by offering a scalable, controlled, and safe data generation process.	SynthCLIP utilizes a four-step process: (1) Concept-based caption generation using an LLM, (2) Caption filtering for balanced concept distribution, (3) Image generation from captions using a text-to-image model (Stable Diffusion), and (4) Standard CLIP training on the synthetic pairs.	SynthCLIP, trained on a large-scale synthetic dataset (SynthCI-30M), achieves performance comparable to CLIP models trained on real datasets like CC12M. Scaling the size of the synthetic dataset significantly improves performance across various vision and vision-language tasks. The quality of captions significantly impacts performance, with synthetic captions generated through a multi-step process, potentially augmented with captioning models, showing the most promise.	The current generation pipeline, while scalable, requires significant computational resources. Future work will explore optimizing resource usage and further improving caption quality and alignment with generated images.	clip, diffusion models, vision-language models, synthetic data, generative networks
2402.01590 Report	NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties	Jingyuan Sun, Mingxiao Li, Zijiao Chen, Marie-Francine Moens	In the pursuit to understand the intricacies of human brain's visual processing, reconstructing dynamic visual experiences from brain activities emerges as a challenging yet fascinating endeavor. While recent advancements have achieved success in reconstructing static images from non-invasive brain recordings, the domain of translating continuous brain activities into video format remains underexplored. In this work, we introduce NeuroCine, a novel dual-phase framework to targeting the inherent challenges of decoding fMRI data, such as noises, spatial redundancy and temporal lags. This framework proposes spatial masking and temporal interpolation-based augmentation for contrastive learning fMRI representations and a diffusion model enhanced by dependent prior noise for video generation. Tested on a publicly available fMRI dataset, our method shows promising results, outperforming the previous state-of-the-art models by a notable margin of ${20.97\%}$, ${31.00\%}$ and ${12.30\%}$ respectively on decoding the brain activities of three subjects in the fMRI dataset, as measured by SSIM. Additionally, our attention analysis suggests that the model aligns with existing brain structures and functions, indicating its biological plausibility and interpretability.	Introduces NeuralFlix, a novel dual-phase framework for reconstructing high-resolution videos from fMRI data, addressing challenges like noise, spatial redundancy, and temporal lags.	Decoding dynamic visual experiences from brain activity is crucial for understanding visual processing and developing technologies for sensory impairments.	Employs spatial masking and temporal interpolation for contrastive learning of fMRI representations, and a diffusion model with dependent prior noise for generating videos.	Significantly outperforms previous state-of-the-art models in decoding brain activities, as measured by SSIM (20.97%, 31.00%, and 12.30% improvements on three subjects). Generates videos with higher semantic accuracy compared to previous methods. Attention analysis suggests biological plausibility with alignments to visual cortex and higher cognitive networks.	Limited fMRI-video paired datasets restrict training data size. Further research is needed to improve temporal coherence and clarity in generated videos.	fmri decoding, video reconstruction, diffusion models, contrastive learning, brain-computer interface
2402.01566 Report	Boximator: Generating Rich and Controllable Motions for Video Synthesis	Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li	Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.	Introduces Boximator, a novel approach for fine-grained video motion control using hard and soft box constraints, functioning as a plug-in for existing video diffusion models.	Addresses the limitations of existing video synthesis methods by enabling precise control over object motion, pose, and interactions using intuitive box-based constraints.	Leverages a novel self-tracking technique during training to learn box-object correlations by generating colored bounding boxes, guiding object generation and motion.	Achieves state-of-the-art video quality (FVD) scores, outperforming base models with and without box constraints. Demonstrates robust motion controllability with significant improvements in bounding box alignment metrics (AP) on MSR-VTT and ActivityNet. Receives strong preference in human evaluation for both video quality and motion control compared to base models.	Reliance on automated bounding box annotations may introduce noise and limit control accuracy. Current implementation focuses on single-object control per box, requiring further exploration for multi-object interactions within a single box.	video synthesis, motion control, diffusion models, self-tracking, box constraints
2402.01524 Report	HyperPlanes: Hypernetwork Approach to Rapid NeRF Adaptation	Paweł Batorski, Dawid Malarz, Marcin Przewięźlikowski, Marcin Mazur, Sławomir Tadeja, Przemysław Spurek	Neural radiance fields (NeRFs) are a widely accepted standard for synthesizing new 3D object views from a small number of base images. However, NeRFs have limited generalization properties, which means that we need to use significant computational resources to train individual architectures for each item we want to represent. To address this issue, we propose a few-shot learning approach based on the hypernetwork paradigm that does not require gradient optimization during inference. The hypernetwork gathers information from the training data and generates an update for universal weights. As a result, we have developed an efficient method for generating a high-quality 3D object representation from a small number of images in a single step. This has been confirmed by direct comparison with the state-of-the-art solutions and a comprehensive ablation study.	Presents HyperPlanes, a novel few-shot learning approach for NeRF-based 3D object representation, leveraging the hypernetwork paradigm to generate updates for universal weights in a single step, eliminating the need for gradient optimization during inference.	Addresses limitations of traditional NeRF models, such as their inability to generalize to new data and the need for extensive training times, by enabling rapid adaptation to new objects with limited data.	Employs a hypernetwork that takes a few support ImagePlanes (HyperPlanes) and the target network weights to generate updates for the target PointMultiPlaneNeRF model, enabling efficient adaptation to new object representations.	Achieves superior results in reconstructing unseen objects compared to gradient-based few-shot learning methods like REPTILE, even without fine-tuning. Exhibits strong generalization capabilities across different object types, outperforming MultiPlaneNeRF in cross-class object rendering. Demonstrates significantly faster object reconstruction (up to 380 times) than vanilla NeRF trained for a large number of epochs.	Potential limitation in achieving the same rendering quality as a vanilla NeRF with extensive training. Future work will focus on exploring techniques to further enhance the rendering quality of the generated 3D objects.	neural radiance fields, nerf, few-shot learning, hypernetworks, 3d object representation
2402.01472 Report	Synthetic Data for the Mitigation of Demographic Biases in Face Recognition	Pietro Melzi, Christian Rathgeb, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Dominik Lawatsch, Florian Domin, Maxim Schaubert	This study investigates the possibility of mitigating the demographic biases that affect face recognition technologies through the use of synthetic data. Demographic biases have the potential to impact individuals from specific demographic groups, and can be identified by observing disparate performance of face recognition systems across demographic groups. They primarily arise from the unequal representations of demographic groups in the training data. In recent times, synthetic data have emerged as a solution to some problems that affect face recognition systems. In particular, during the generation process it is possible to specify the desired demographic and facial attributes of images, in order to control the demographic distribution of the synthesized dataset, and fairly represent the different demographic groups. We propose to fine-tune with synthetic data existing face recognition systems that present some demographic biases. We use synthetic datasets generated with GANDiffFace, a novel framework able to synthesize datasets for face recognition with controllable demographic distribution and realistic intra-class variations. We consider multiple datasets representing different demographic groups for training and evaluation. Also, we fine-tune different face recognition systems, and evaluate their demographic fairness with different metrics. Our results support the proposed approach and the use of synthetic data to mitigate demographic biases in face recognition.	This paper investigates the use of synthetic data, generated by their novel GANDiffFace framework, to mitigate demographic bias in face recognition systems.	Face recognition systems often exhibit biases against certain demographic groups due to unequal representation in training datasets. This study explores a solution using synthetic data to improve fairness.	The authors fine-tuned two popular face recognition systems (ArcFace and CosFace) with synthetic datasets generated by GANDiffFace. They used two different fine-tuning datasets: one specifically representing the biased demographic (Asian), and another with balanced demographic representation. The effectiveness of bias mitigation was assessed using fairness metrics (FDR, IR, and GARBE) on two real-world datasets (DiveFace and RFW).	Fine-tuning ArcFace with the balanced synthetic dataset mitigated bias effectively, leading to improved fairness metrics. Fine-tuning ArcFace with the Asian-specific synthetic dataset negatively impacted fairness by reducing FMR for Asians to levels significantly lower than other groups. Fine-tuning CosFace with the Asian-specific synthetic dataset showed minor fairness improvements, while using the balanced synthetic dataset did not yield consistent positive results.	The study is limited to addressing bias against one specific demographic (Asian). Further investigation with different synthetic datasets and a wider range of demographic groups is needed.	face recognition, demographic bias, fairness, synthetic data, gandiffface
2402.01459 Report	GaMeS: Mesh-Based Adapting and Modification of Gaussian Splatting	Joanna Waczyńska, Piotr Borycki, Sławomir Tadeja, Jacek Tabor, Przemysław Spurek	Recently, a range of neural network-based methods for image rendering have been introduced. One such widely-researched neural radiance field (NeRF) relies on a neural network to represent 3D scenes, allowing for realistic view synthesis from a small number of 2D images. However, most NeRF models are constrained by long training and inference times. In comparison, Gaussian Splatting (GS) is a novel, state-of-the-art technique for rendering points in a 3D scene by approximating their contribution to image pixels through Gaussian distributions, warranting fast training and swift, real-time rendering. A drawback of GS is the absence of a well-defined approach for its conditioning due to the necessity to condition several hundred thousand Gaussian components. To solve this, we introduce the Gaussian Mesh Splatting (GaMeS) model, which allows modification of Gaussian components in a similar way as meshes. We parameterize each Gaussian component by the vertices of the mesh face. Furthermore, our model needs mesh initialization on input or estimated mesh during training. We also define Gaussian splats solely based on their location on the mesh, allowing for automatic adjustments in position, scale, and rotation during animation. As a result, we obtain a real-time rendering of editable GS.	This paper introduces Gaussian Mesh Splatting (GaMeS), a novel method for representing and rendering editable Gaussian Splatting (GS) models using meshes. GaMeS parameterizes Gaussian components on mesh faces, enabling automatic adaptation to mesh modifications and facilitating real-time animation.	Efficiently conditioning GS models, which consist of hundreds of thousands of Gaussian components, is challenging. GaMeS addresses this by directly coupling Gaussian components with mesh structures, enabling real-time editing and animation while maintaining rendering quality comparable to GS.	GaMeS represents 3D scenes using Gaussian components positioned on mesh faces. It either utilizes existing meshes or generates a simplified mesh (pseudo-mesh) directly from Gaussian components. Gaussian parameters (mean, covariance) are parameterized by mesh vertices, ensuring automatic adaptation to mesh transformations.	GaMeS achieves comparable rendering quality to state-of-the-art methods on the NeRF-Synthetic and Mip-NeRF360 datasets. GaMeS allows for real-time editing and animation of 3D scenes by manipulating the underlying mesh. The method effectively handles scenarios with and without pre-existing meshes, demonstrating flexibility in diverse applications.	GaMeS may exhibit artifacts during significant mesh modifications, particularly with large mesh faces. Future work involves exploring strategies to handle Gaussian component adaptation when mesh faces are split during modification.	gaussian splatting, mesh representation, 3d scene editing, real-time rendering, neural rendering
2402.01368 Report	LIR: A Lightweight Baseline for Image Restoration	Dongqi Fan, Ting Yue, Xin Zhao, Liang Chang	Recently, there have been significant advancements in Image Restoration based on CNN and transformer. However, the inherent characteristics of the Image Restoration task are often overlooked. Many works, instead, only focus on the basic block design and stack numerous such blocks to the model, leading to parameters redundant and computations unnecessary. Thus, the efficiency of the image restoration is hindered. In this paper, we propose a Lightweight Baseline for Image Restoration called LIR to efficiently reconstruct the image and remove degradations (blur, rain, noise, haze). First of all, LIR addresses the degradations existing in the local and global residual connections that are ignored by modern networks, through a simple structural design. Then, to achieve lightweight, a Lightweight Adaptive Attention (LAA) Block is introduced depending on the inherent characteristics of the Image Restoration, which is mainly composed of proposed Adaptive Filters and Attention Blocks. LAA is capable of adaptively sharpening contours, removing degradation, and capturing global information in various Image Restoration scenes in a computation-friendly manner. Extensive experiments demonstrate that our LIR achieves comparable performance to state-of-the-art models with fewer parameters and computations in certain tasks. In addition, it is worth noting that our LIR produces better visual results than state-of-the-art networks that are more in line with the human aesthetic.	This paper introduces LIR, a lightweight image restoration network that effectively removes degradations (blur, rain, noise, haze) from images.	Many existing image restoration networks prioritize complex block designs and stacking, leading to excessive parameters and computations. LIR aims to address this by offering a more efficient and lightweight solution.	LIR leverages a novel Lightweight Adaptive Attention (LAA) block composed of Adaptive Filters and Attention Blocks. This design enables adaptive sharpening of contours, degradation removal, and efficient global information capture.	LIR achieves comparable performance to state-of-the-art models on Rain100L for deraining while using fewer parameters and computations. LIR surpasses MLP and Transformer-based methods on SOTS outdoor for dehazing, demonstrating strong performance with fewer computations. LIR shows comparable performance to state-of-the-art on denoising (CBSD68, Urban100) and deblurring (GoPro, HIDE) tasks with a lightweight design.	LIR's performance on deblurring, while strong, is less prominent compared to deraining and denoising, possibly due to limitations in handling dynamic contours of fast-moving objects. Future work could explore enhancements to the Adaptive Filter to better address the challenges posed by dynamic contours in deblurring tasks.	image restoration, lightweight, attention, cnn, adaptive filter
2402.01355 Report	FindingEmo: An Image Dataset for Emotion Recognition in the Wild	Laurent Mertens, Elahe' Yargholi, Hans Op de Beeck, Jan Van den Stock, Joost Vennekens	We introduce FindingEmo, a new image dataset containing annotations for 25k images, specifically tailored to Emotion Recognition. Contrary to existing datasets, it focuses on complex scenes depicting multiple people in various naturalistic, social settings, with images being annotated as a whole, thereby going beyond the traditional focus on faces or single individuals. Annotated dimensions include Valence, Arousal and Emotion label, with annotations gathered using Prolific. Together with the annotations, we release the list of URLs pointing to the original images, as well as all associated source code.	Introduces FindingEmo, a new image dataset for Emotion Recognition in the Wild, focusing on complex scenes with multiple people in naturalistic social settings, annotated for Valence, Arousal, and Emotion label.	Addresses the lack of datasets for emotion recognition beyond facial expressions, emphasizing the importance of context and social dynamics in understanding emotions.	Collected 25k images from the internet using a custom scraper and filtering process, followed by annotation using Plutchik's Wheel of Emotions via Prolific platform.	Annotations show expected correlations between Valence, Arousal, and Emotion labels. Baseline models trained on ImageNet outperform Places365-trained models, suggesting natural object features are more salient for emotion recognition. Late fusion with facial emotion recognition features significantly improves performance, highlighting the importance of facial expressions in complex social scenes.	Dataset exhibits imbalance in emotion label distribution, reflecting real-world prevalence but potentially impacting model training. Limited annotator diversity (primarily young adults) may introduce bias in annotations.	computer vision, dataset, emotion recognition, affective computing, social cognition
2402.01345 Report	Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models	Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou	Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language. Despite these advances, LVLMs still face challenges with multimodal hallucination, such as generating text descriptions of objects that are not present in the visual information. However, the underlying fundamental reasons of multimodal hallucinations remain poorly explored. In this paper, we propose a new perspective, suggesting that the inherent biases in LVLMs might be a key factor in hallucinations. Specifically, we systematically identify a semantic shift bias related to paragraph breaks (\n\n), where the content before and after '\n\n' in the training data frequently exhibit significant semantic changes. This pattern leads the model to infer that the contents following '\n\n' should be obviously different from the preceding contents with less hallucinatory descriptions, thereby increasing the probability of hallucinatory descriptions subsequent to the '\n\n'. We have validated this hypothesis on multiple publicly available LVLMs. Besides, we find that deliberately inserting '\n\n' at the generated description can induce more hallucinations. A simple method is proposed to effectively mitigate the hallucination of LVLMs by skipping the output of '\n'.	This paper identifies a semantic shift bias in LVLMs triggered by paragraph breaks ('\n\n') that can induce hallucinations.	Hallucinations in LVLMs, where models generate descriptions of objects not present in the visual input, limit their deployment in safety-critical applications.	The authors analyze the impact of paragraph breaks on hallucination severity in six LVLMs using the CHAIR evaluation framework and propose two mitigation methods: modifying prompts (MiHI) and adjusting decoding strategies (MiHO) to avoid '\n\n'.	Descriptions generated after '\n\n' exhibit significantly more hallucinations. Manually inserting '\n\n' in generated descriptions increases hallucination probability. Both MiHI and MiHO effectively reduce hallucinations across most LVLMs, especially with greedy decoding.	MiHI effectiveness depends on the LVLMs' instruction fine-tuning, showing less improvement in models like Fuyu-8B. The influence of model scale on the '\n\n'-induced hallucination problem requires further investigation.	multimodal hallucination, large vision-language models, semantic shift bias, hallucination mitigation, paragraph breaks
2402.01239 Report	PRIME: Protect Your Videos From Malicious Editing	Guanlin Li, Shuai Yang, Jie Zhang, Tianwei Zhang	With the development of generative models, the quality of generated content keeps increasing. Recently, open-source models have made it surprisingly easy to manipulate and edit photos and videos, with just a few simple prompts. While these cutting-edge technologies have gained popularity, they have also given rise to concerns regarding the privacy and portrait rights of individuals. Malicious users can exploit these tools for deceptive or illegal purposes. Although some previous works focus on protecting photos against generative models, we find there are still gaps between protecting videos and images in the aspects of efficiency and effectiveness. Therefore, we introduce our protection method, PRIME, to significantly reduce the time cost and improve the protection performance. Moreover, to evaluate our proposed protection method, we consider both objective metrics and human subjective metrics. Our evaluation results indicate that PRIME only costs 8.3% GPU hours of the cost of the previous state-of-the-art method and achieves better protection results on both human evaluation and objective metrics. Code can be found in https://github.com/GuanlinLee/prime.	This paper introduces PRIME, a novel black-box video protection method designed to safeguard videos against malicious editing techniques that exploit Latent Diffusion Models (LDMs).	The rise of advanced video editing tools powered by LDMs poses a significant threat to individuals' privacy and portrait rights, enabling malicious actors to create and spread harmful content.	PRIME leverages the transferability of adversarial perturbations, incorporating them into every frame of the video while employing mechanisms like 'fast convergence searching' and 'early stage stopping' to reduce computation time. Additionally, it utilizes an 'anti-dynamic compression' method to maintain perturbation effectiveness even after video compression.	PRIME significantly reduces the time needed for video protection, requiring only 8.3% of the time taken by the baseline method Photoguard. It effectively disrupts malicious editing attempts, resulting in lower quality edited videos with reduced prompt matching as per human evaluations. PRIME demonstrates better transferability across different LDM models and editing pipelines compared to previous methods.	The evaluation of malicious video editing is limited by the absence of a standardized benchmark dataset and the reliance on subjective human evaluation. Future work could explore the development of a robust, publicly available dataset for malicious video editing and protection research.	video protection, malicious editing, latent diffusion models, adversarial perturbations, privacy protection
2402.01217 Report	Taming Uncertainty in Sparse-view Generalizable NeRF via Indirect Diffusion Guidance	Yaokun Li, Chao Gou, Guang Tan	Neural Radiance Fields (NeRF) have demonstrated effectiveness in synthesizing novel views. However, their reliance on dense inputs and scene-specific optimization has limited their broader applicability. Generalizable NeRFs (Gen-NeRF), while intended to address this, often produce blurring artifacts in unobserved regions with sparse inputs, which are full of uncertainty. In this paper, we aim to diminish the uncertainty in Gen-NeRF for plausible renderings. We assume that NeRF's inability to effectively mitigate this uncertainty stems from its inherent lack of generative capacity. Therefore, we innovatively propose an Indirect Diffusion-guided NeRF framework, termed ID-NeRF, to address this uncertainty from a generative perspective by leveraging a distilled diffusion prior as guidance. Specifically, to avoid model confusion caused by directly regularizing with inconsistent samplings as in previous methods, our approach introduces a strategy to indirectly inject the inherently missing imagination into the learned implicit function through a diffusion-guided latent space. Empirical evaluation across various benchmarks demonstrates the superior performance of our approach in handling uncertainty with sparse inputs.	Presents ID-NeRF, a novel Gen-NeRF framework that addresses uncertainty in unobserved regions of sparse-view scenarios through indirect guidance from a pre-trained diffusion model.	Existing Gen-NeRFs struggle with blurry artifacts in unobserved regions due to a lack of generative capacity to handle the uncertainty associated with sparse inputs.	ID-NeRF leverages score-based distillation to inject generative knowledge into a latent space, which then guides the refinement of reprojected visual features extracted from sparse views. This indirect guidance avoids model confusion caused by inconsistent direct supervision.	ID-NeRF outperforms SOTA Gen-NeRFs on DTU, Blender, and RFF datasets, especially in challenging 2-input view settings. The method demonstrates superior performance in handling uncertainty, achieving better results as input sparsity increases. Ablation studies confirm the effectiveness of the latent space, attention-based guidance, and indirect supervision strategy.	There's room for improvement in image fidelity, particularly in terms of SSIM and LPIPS metrics. Future work could explore faster inference and reduced model size for practical deployment.	generative neural radiance fields, sparse-view reconstruction, uncertainty mitigation, indirect diffusion guidance, score-based distillation
2402.01162 Report	2AFC Prompting of Large Multimodal Models for Image Quality Assessment	Hanwei Zhu, Xiangjie Sui, Baoliang Chen, Xuelin Liu, Peilin Chen, Yuming Fang, Shiqi Wang	While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs.	This paper proposes a framework to evaluate the Image Quality Assessment (IQA) capability of Large Multimodal Models (LMMs) using a 2AFC prompting approach and three evaluation metrics.	This is important because while LMMs' high-level visual understanding has been studied, their low-level visual processing abilities, such as IQA, remain largely unexplored.	The study uses coarse-to-fine pairing rules for image comparison and employs maximum a posterior (MAP) estimation to aggregate pairwise preferences into global quality rankings. Three evaluation criteria: consistency, accuracy, and correlation are introduced to quantify LMMs' IQA performance.	Open-source LMMs show poor consistency and potential biases in IQA tasks. GPT-4V exhibits superior IQA ability compared to other LMMs, especially on realistically distorted images. Existing LMMs, including GPT-4V, struggle with fine-grained IQA, indicating areas for improvement.	Limited number of open-source LMMs are evaluated due to the requirement of accepting multiple images as input. Future work includes extending the evaluation to more LMMs and exploring advanced prompting techniques to further enhance their IQA capabilities.	large multimodal models, image quality assessment, two-alternative forced choice, map estimation, fine-grained iqa
2402.01123 Report	A Single Simple Patch is All You Need for AI-generated Image Detection	Jiaxuan Chen, Jieteng Yao, Li Niu	The recent development of generative models unleashes the potential of generating hyper-realistic fake images. To prevent the malicious usage of fake images, AI-generated image detection aims to distinguish fake images from real images. However, existing method suffer from severe performance drop when detecting images generated by unseen generators. We find that generative models tend to focus on generating the patches with rich textures to make the images more realistic while neglecting the hidden noise caused by camera capture present in simple patches. In this paper, we propose to exploit the noise pattern of a single simple patch to identify fake images. Furthermore, due to the performance decline when handling low-quality generated images, we introduce an enhancement module and a perception module to remove the interfering information. Extensive experiments demonstrate that our method can achieve state-of-the-art performance on public benchmarks.	This paper proposes a novel AI-generated image detection method called Single Simple Patch (SSP) network that leverages noise patterns in simple image patches to distinguish between real and fake images.	AI-generated image detection is crucial to prevent the malicious use of hyper-realistic fake images, but existing methods struggle with generalization across different generators and image quality degradation.	The method extracts the simplest patch from an image, analyzes its noise fingerprints using SRM filters, and employs a ResNet50 classifier. An enhanced version incorporates an enhancement module and a perception module to mitigate blur and compression artifacts.	SSP network effectively distinguishes real and fake images by focusing on noise patterns in simple patches, outperforming existing methods on cross-generator settings. The enhanced SSP network demonstrates robustness to image quality degradation like blur and compression, achieving improved accuracy on low-quality images. Experimental results on GenImage and ForenSynths datasets show superior performance compared to state-of-the-art methods, highlighting the effectiveness of the proposed approach.	The method's performance may be limited when dealing with extremely low-quality images, particularly those with compression quality lower than 90. Future work could explore integrating information from simple patches with other aspects of the original image to further enhance robustness.	ai-generated image detection, generative models, noise pattern analysis, simple patch, image forensics
2402.00909 Report	Generalizing GradCAM for Embedding Networks	Mudit Bachhawat	Visualizing CNN is an important part in building trust and explaining model's prediction. Methods like CAM and GradCAM have been really successful in localizing area of the image responsible for the output but are only limited to classification models. In this paper, we present a new method EmbeddingCAM, which generalizes the Grad-CAM for embedding networks. We show that for classification networks, EmbeddingCAM reduces to GradCAM. We show the effectiveness of our method on CUB-200-2011 dataset and also present quantitative and qualitative analysis on the dataset.	This paper presents EmbeddingCAM, a novel method for generating GradCAM-style heatmaps to explain predictions from any visual embedding network.	Visualizing and explaining predictions of embedding networks, increasingly used in applications like open-set classification, is crucial for building trust and understanding model behavior.	EmbeddingCAM uses class proxies as substitutes for class labels and defines a custom loss based on the agreement between the model output and the proxy. This loss is then backpropagated to generate a heatmap, similar to GradCAM for classification networks.	EmbeddingCAM successfully generates heatmaps highlighting relevant image regions for embedding networks. It outperforms or achieves comparable performance to previous methods on the CUB-200-2011 dataset for mean heatmap ratio and weakly supervised localization accuracy. Unlike prior methods, EmbeddingCAM does not require multiple image sampling or test-time indexing, making it more efficient and generalizable.	The paper primarily evaluates EmbeddingCAM on the CUB-200-2011 dataset, focusing on fine-grained classification; further exploration on diverse datasets and tasks is needed. Future work could explore generating heatmaps at the input image scale, potentially revealing finer-grained insights.	explainable ai, visual embedding networks, gradcam, heatmap visualization, metric learning
2402.00867 Report	AToM: Amortized Text-to-Mesh using 2D Diffusion	Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov	We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost, and generalizes to unseen prompts. Our key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and enables scalability. Through extensive experiments on various prompt benchmarks, AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy (in DF415 dataset) and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering finegrained 3D assets for unseen interpolated prompts without further optimization during inference, unlike per-prompt solutions.	This paper introduces AToM, the first amortized text-to-mesh model that directly generates textured meshes from text prompts.	AToM addresses the limitations of existing text-to-3D methods that are either time-consuming per-prompt optimizations or limited to representations other than polygonal meshes.	AToM employs a novel triplane-based text-to-mesh architecture and a two-stage amortized optimization strategy. It first trains with low-resolution volumetric rendering and then refines with high-resolution mesh rasterization.	AToM generates high-quality textured meshes in under one second from a text prompt. AToM generalizes to unseen text prompts without requiring further optimization, unlike per-prompt methods. AToM outperforms state-of-the-art amortized approaches like ATT3D with significantly higher accuracy and quality, especially in large-scale datasets.	The quality of AToM is currently limited by the resolution of the text-to-image diffusion prior used. The DMTet mesh representation used in AToM cannot model surfaces with nonzero genus.	text-to-mesh, amortized optimization, 3d generation, generative ai, diffusion models
2402.00864 Report	ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields	Jiahua Dong, Yu-Xiong Wang	We introduce ViCA-NeRF, the first view-consistency-aware method for 3D editing with text instructions. In addition to the implicit neural radiance field (NeRF) modeling, our key insight is to exploit two sources of regularization that explicitly propagate the editing information across different views, thus ensuring multi-view consistency. For geometric regularization, we leverage the depth information derived from NeRF to establish image correspondences between different views. For learned regularization, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update throughout the entire scene. Incorporating these two strategies, our ViCA-NeRF operates in two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training, dedicated to further refining the scene's appearance. Experimental results demonstrate that ViCA-NeRF provides more flexible, efficient (3 times faster) editing with higher levels of consistency and details, compared with the state of the art. Our code is publicly available.	Introduces ViCA-NeRF, the first view-consistency-aware method for editing 3D scenes via text instructions, improving upon existing methods by enhancing multi-view consistency and editing efficiency.	Addresses the limitations of existing NeRF editing methods which lack explicit 3D structure and are computationally expensive, leading to inconsistencies and inefficiencies in editing.	Leverages two sources of regularization: geometric regularization through depth-guided image correspondence for preliminary edits and learned regularization via a blending refinement model (modified Instruct-Pix2Pix) to align latent codes across views, ensuring consistency.	Achieves multi-view consistent 3D editing with text instructions across diverse scenes. Offers controllability by allowing edits in key views to propagate throughout the 3D scene. Significantly faster than previous methods (3 times faster than Instruct-NeRF2NeRF).	Effectiveness depends on the accuracy of depth maps generated by NeRF. Edited outputs may exhibit increased blurriness compared to the original NeRF.	nerf, 3d editing, text-guided editing, view consistency, diffusion models
2402.00863 Report	Geometry Transfer for Stylizing Radiance Fields	Hyunyoung Jung, Seonghyeon Nam, Nikolaos Sarafianos, Sungjoo Yoo, Alexander Sorkine-Hornung, Rakesh Ranjan	Shape and geometric patterns are essential in defining stylistic identity. However, current 3D style transfer methods predominantly focus on transferring colors and textures, often overlooking geometric aspects. In this paper, we introduce Geometry Transfer, a novel method that leverages geometric deformation for 3D style transfer. This technique employs depth maps to extract a style guide, subsequently applied to stylize the geometry of radiance fields. Moreover, we propose new techniques that utilize geometric cues from the 3D scene, thereby enhancing aesthetic expressiveness and more accurately reflecting intended styles. Our extensive experiments show that Geometry Transfer enables a broader and more expressive range of stylizations, thereby significantly expanding the scope of 3D style transfer.	Introduces "Geometry Transfer," a novel method that leverages geometric deformation for 3D style transfer using depth maps to extract style guides, thereby stylizing the geometry of radiance fields.	Existing 3D style transfer techniques primarily focus on transferring color and texture while neglecting geometry, which is crucial for defining stylistic identity.	Utilizes depth maps as style guides, introduces a deformation network for synchronized shape and appearance modification, and proposes RGB-D stylization techniques like geometry-aware matching and perspective style augmentation.	Enables coherent stylization of both shape and appearance in 3D scenes. Demonstrates superior performance compared to existing 3D style transfer methods in quantitative metrics and user studies. Seamlessly integrates with Panoptic Lifting for partial stylization of 3D scenes.	Limited to non-360° scenes due to the reliance on TensoRF representation. Stylizing 360° scenes with a single style image is ill-posed; exploring multi-view or 3D style guides could be beneficial.	3d style transfer, geometry stylization, radiance fields, depth maps, deformation fields
2402.00769 Report	AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning	Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, Hongsheng Li	Video diffusion models has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model (CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model (LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhance the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in stable diffusion community to achieve various functions (e.g., ControlNet for controllable generation). we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results. Experimental results validate the effectiveness of our proposed method. Code and weights will be made public. More details are available at https://github.com/G-U-N/AnimateLCM.	Presents AnimateLCM, a novel approach for fast and high-fidelity video generation within minimal steps by adapting Stable Diffusion-based video models to follow the self-consistency property.	Addresses the computational intensity and slow generation of video diffusion models, aiming for high-quality video generation with significantly reduced steps.	Employs a decoupled consistency learning strategy, separating image generation and motion priors distillation, and a teacher-free adaptation strategy for integrating or training adapters without sacrificing speed.	Achieves state-of-the-art results on UCF-101 in terms of FVD and CLIPSIM metrics, outperforming baseline diffusion models, especially in low step regimes. Demonstrates good compatibility with personalized image diffusion models, allowing diverse and high-quality video generation in various styles. Enables fast and high-quality image-to-video and controllable video generation with minimal steps through the teacher-free adaptation strategy.	Limited performance for one-step sample generation, potentially resulting in blurry or artifact-ridden outputs. Future work includes exploring more sophisticated ODE solvers and alternative score estimation strategies for improved one-step generation quality.	video generation, diffusion models, consistency models, stable diffusion, video acceleration
2402.00752 Report	On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy	Letian Huang, Jiayang Bai, Jie Guo, Yuanqi Li, Yanwen Guo	3D Gaussian Splatting has garnered extensive attention and application in real-time neural rendering. Concurrently, concerns have been raised about the limitations of this technology in aspects such as point cloud storage, performance, and robustness in sparse viewpoints, leading to various improvements. However, there has been a notable lack of attention to the fundamental problem of projection errors introduced by the local affine approximation inherent in the splatting itself, and the consequential impact of these errors on the quality of photo-realistic rendering. This paper addresses the projection error function of 3D Gaussian Splatting, commencing with the residual error from the first-order Taylor expansion of the projection function. The analysis establishes a correlation between the error and the Gaussian mean position. Subsequently, leveraging function optimization theory, this paper analyzes the function's minima to provide an optimal projection strategy for Gaussian Splatting referred to Optimal Gaussian Splatting, which can accommodate a variety of camera models. Experimental validation further confirms that this projection methodology reduces artifacts, resulting in a more convincingly realistic rendering.	This paper presents Optimal Gaussian Splatting, a novel projection method for 3D Gaussian Splatting (3D-GS) that minimizes projection errors to enhance rendering quality.	Existing 3D-GS techniques suffer from projection errors due to local affine approximations, leading to artifacts in rendered images, especially with wide-angle lenses. This work addresses this by minimizing these errors to improve rendering realism.	The authors analyze the error function of the 3D-GS projection, identifying its correlation with Gaussian mean position. They then derive an optimal projection strategy that minimizes this error by projecting each Gaussian onto a tangent plane based on its mean and the camera center.	Optimal Gaussian Splatting reduces artifacts and enhances rendering quality compared to the original 3D-GS. The method demonstrates robustness against increasing field of view and decreasing focal length, outperforming 3D-GS in wide-angle settings. It is easily adaptable to various camera models like fisheye and panorama with simple modifications.	Training time slightly increases due to the additional transformation from tangent plane to image plane. Future work could explore optimizing Gaussian covariance's influence on projection and further enhance Gaussian Splatting as a scene representation technique.	3d gaussian splatting, novel view synthesis, error analysis, optimal projection, real-time rendering
2402.00631 Report	Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation	Yang Li, Songlin Yang, Wei Wang, Jing Dong	Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.	This paper introduces a novel method for inserting accurate and interactive identity embeddings into pre-trained Text-to-Image diffusion models for personalized image generation.	Existing methods for personalized generation often struggle with attention overfitting (embedding ID-unrelated information) and limited semantic fidelity, leading to inaccurate and inflexible image generation.	The proposed method utilizes a face-wise attention loss to focus on ID-related face regions and neglect background information, and optimizes ID representation as multiple per-stage tokens with disentangled features for enhanced semantic control.	The method achieves higher accuracy in identity embedding compared to previous approaches. It exhibits superior interactive generative ability, enabling control over scenes, facial attributes, and actions. The approach requires minimal training time and introduces fewer parameters compared to some existing techniques.	The manipulation capacity for diverse and high-fidelity image generation can be further improved. The study primarily focuses on face embedding and can be extended to encompass a wider range of object categories.	text-to-image generation, diffusion models, personalized generation, semantic fidelity, attention overfitting
2402.00627 Report	CapHuman: Capture Your Moments in Parallel Universes	Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang	We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.	This paper presents CapHuman, a novel framework for human-centric image synthesis that allows for the generation of photorealistic portraits of specific individuals with controllable head positions, poses, facial expressions, and illuminations in different contexts.	This work addresses the limitations of existing text-to-image models that struggle with identity preservation and fine-grained control over human head features, particularly in one-shot settings.	CapHuman leverages the pretrained Stable Diffusion model and incorporates two key components: 1) an "encode then learn to align" paradigm for identity preservation using global and local features and 2) a 3D facial prior (FLAME) for flexible and 3D-consistent head control.	CapHuman generates high-quality, identity-preserved portraits with diverse head renditions, outperforming previous state-of-the-art methods. The model demonstrates generalizable identity preservation capabilities, eliminating the need for cumbersome fine-tuning for each new individual. Quantitative and qualitative analysis on the proposed HumanIPHC benchmark confirm the effectiveness of CapHuman in identity preservation, text-to-image alignment, and head control precision.	The model's generative capabilities are limited by the pre-training dataset and may not generalize well to scenarios outside its distribution. The accuracy of 3D facial reconstruction relies on the estimation accuracy of DECA, which can be limited for extreme poses and expressions, leading to potential misalignment.	image synthesis, identity preservation, head control, diffusion models, 3d facial prior
2402.00626 Report	Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks	Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer	Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. However, the susceptibility of recent Large Vision-Language Models to these attacks remains understudied. Furthermore, prior work's Typographic attacks against CLIP randomly sample a misleading class from a predefined set of categories. However, this simple strategy misses more effective attacks that exploit LVLM(s) stronger language skills. To address these issues, we first introduce a benchmark for testing Typographic attacks against LVLM(s). Moreover, we introduce two novel and more effective \textit{Self-Generated} attacks which prompt the LVLM to generate an attack against itself: 1) Class Based Attack where the LVLM (e.g. LLaVA) is asked which deceiving class is most similar to the target class and 2) Descriptive Attacks where a more advanced LVLM (e.g. GPT4-V) is asked to recommend a Typographic attack that includes both a deceiving class and description. Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33\%. We also uncover that attacks generated by one model (e.g. GPT-4V or LLaVA) are effective against the model itself and other models like InstructBLIP and MiniGPT4. Code: \url{https://github.com/mqraitem/Self-Gen-Typo-Attack}	The paper introduces a new benchmark for evaluating typographic attacks against Large Vision Language Models (LVLMs) and proposes novel self-generated attacks that leverage the LVLMs themselves to devise more effective attacks.	This work addresses the urgent threat of typographic attacks that can mislead LVLMs by exploiting their reliance on textual cues for image interpretation, especially given the increasing sophistication and accessibility of these models.	The authors develop a benchmark using five diverse classification datasets and evaluate four recent LVLMs (GPT-4V, LLaVA 1.5, MiniGPT4, and InstructBLIP) against three types of attacks: random class, class-based (using LVLMs to identify similar classes), and descriptive (using LVLMs to generate deceiving descriptions).	Self-generated attacks, particularly descriptive attacks, significantly reduce LVLMs' classification accuracy (up to 33%). Descriptive attacks with relevant descriptions are more effective than those with random or no descriptions, highlighting LVLMs' language understanding capabilities. While prompting LVLMs to ignore text shows some improvement, it doesn't fully mitigate the impact of typographic attacks.	The study is limited by the computational cost of evaluating GPT-4V, restricting the number of test samples. Future work should explore defenses against these attacks and investigate their generalization to other LVLM tasks beyond classification.	typographic attacks, large vision language models, self-generated attacks, benchmarking, vision and language
2402.00606 Report	Dynamic Texture Transfer using PatchMatch and Transformers	Guo Pu, Shiyao Xu, Xixin Cao, Zhouhui Lian	How to automatically transfer the dynamic texture of a given video to the target still image is a challenging and ongoing problem. In this paper, we propose to handle this task via a simple yet effective model that utilizes both PatchMatch and Transformers. The key idea is to decompose the task of dynamic texture transfer into two stages, where the start frame of the target video with the desired dynamic texture is synthesized in the first stage via a distance map guided texture transfer module based on the PatchMatch algorithm. Then, in the second stage, the synthesized image is decomposed into structure-agnostic patches, according to which their corresponding subsequent patches can be predicted by exploiting the powerful capability of Transformers equipped with VQ-VAE for processing long discrete sequences. After getting all those patches, we apply a Gaussian weighted average merging strategy to smoothly assemble them into each frame of the target stylized video. Experimental results demonstrate the effectiveness and superiority of the proposed method in dynamic texture transfer compared to the state of the art.	Proposes DynTexture, a novel neural-based approach to automatically transfer dynamic texture effects from a source video to a target image, enabling one-shot dynamic texture transfer.	Automates the laborious process of designing dynamic textures, such as those used in films, digital posters, and online media, thereby improving efficiency and enabling more creative applications.	Uses a two-stage architecture: 1) a distance map guided texture transfer module (based on PatchMatch) to synthesize the initial stylized frame, and 2) a deep sequence forecasting module (based on Transformers and VQ-VAE) to predict and synthesize subsequent stylized frames.	Achieves superior performance in dynamic text effects transfer with various font styles and glyphs, accurately transferring complex dynamic effects like burning flames and flowing water. Outperforms state-of-the-art methods in qualitative and quantitative comparisons, demonstrating better texture quality, spatial and temporal consistency, and the ability to handle moving dynamic effects. Demonstrates versatility in other applications like image animation, modifying image layouts, and animating them according to driving videos.	The choice of patch size is crucial and requires careful consideration for optimal performance. Quantitative evaluation of one-shot learning tasks remains challenging due to the lack of ground truth data.	texture transfer, video synthesis, image generation, patchmatch, transformers
2402.00525 Report	StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering	Lukas Radl, Michael Steiner, Mathias Parger, Alexander Weinrauch, Bernhard Kerbl, Markus Steinberger	Gaussian Splatting has emerged as a prominent model for constructing 3D representations from images across diverse domains. However, the efficiency of the 3D Gaussian Splatting rendering pipeline relies on several simplifications. Notably, reducing Gaussian to 2D splats with a single view-space depth introduces popping and blending artifacts during view rotation. Addressing this issue requires accurate per-pixel depth computation, yet a full per-pixel sort proves excessively costly compared to a global sort operation. In this paper, we present a novel hierarchical rasterization approach that systematically resorts and culls splats with minimal processing overhead. Our software rasterizer effectively eliminates popping artifacts and view inconsistencies, as demonstrated through both quantitative and qualitative measurements. Simultaneously, our method mitigates the potential for cheating view-dependent effects with popping, ensuring a more authentic representation. Despite the elimination of cheating, our approach achieves comparable quantitative results for test images, while increasing the consistency for novel view synthesis in motion. Due to its design, our hierarchical approach is only 4% slower on average than the original Gaussian Splatting. Notably, enforcing consistency enables a reduction in the number of Gaussians by approximately half with nearly identical quality and view-consistency. Consequently, rendering performance is nearly doubled, making our approach 1.6x faster than the original Gaussian Splatting, with a 50% reduction in memory requirements.	This paper presents a novel hierarchical rasterization approach for 3D Gaussian Splatting that addresses popping artifacts by performing per-pixel sorting of splats, while maintaining real-time performance.	Popping artifacts are a common issue in Gaussian Splatting, particularly noticeable during camera rotations, which detracts from the realism and quality of rendered scenes.	The proposed method utilizes a hierarchical rendering pipeline that exploits coherence among neighboring view rays on multiple hierarchy levels, interleaving culling, depth evaluation, and resorting operations.	The hierarchical renderer effectively eliminates popping artifacts and view inconsistencies. It achieves comparable quantitative image quality metrics to the original Gaussian Splatting method. The hierarchical approach adds an overhead of only 4% compared to the original Gaussian Splatting, while enabling a 2x reduction in memory and 1.6x faster rendering by using Opacity Decay during training.	The hierarchical resorting may not guarantee perfect blend order in all cases, potentially leading to residual artifacts. The method still approximates true 3D Gaussian rendering, ignoring potential overlaps between Gaussians along a view ray. Future work could investigate fully correct volume rendering of Gaussians for further quality improvements.	gaussian splatting, neural rendering, view consistency, hierarchical rasterization, real-time rendering
2402.00351 Report	Machine Unlearning for Image-to-Image Generative Models	Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu	Machine unlearning has emerged as a new paradigm to deliberately forget data samples from a given model in order to adhere to stringent regulations. However, existing machine unlearning methods have been primarily focused on classification models, leaving the landscape of unlearning for generative models relatively unexplored. This paper serves as a bridge, addressing the gap by providing a unifying framework of machine unlearning for image-to-image generative models. Within this framework, we propose a computationally-efficient algorithm, underpinned by rigorous theoretical analysis, that demonstrates negligible performance degradation on the retain samples, while effectively removing the information from the forget samples. Empirical studies on two large-scale datasets, ImageNet-1K and Places-365, further show that our algorithm does not rely on the availability of the retain samples, which further complies with data retention policy. To our best knowledge, this work is the first that represents systemic, theoretical, empirical explorations of machine unlearning specifically tailored for image-to-image generative models. Our code is available at https://github.com/jpmorganchase/l2l-generator-unlearning.	This paper presents the first systematic exploration of machine unlearning for image-to-image (I2I) generative models, proposing a novel framework and an efficient algorithm.	Machine unlearning for generative models is crucial due to their superior data memorization capability and the increasing demand for data privacy and copyright protection.	The authors formulate the problem as maximizing the KL-divergence between the distributions of generated images from the forget set and a Gaussian distribution. They then derive a tractable lower bound based on mutual information and minimize the L2 distance between encoder outputs.	The proposed approach effectively eliminates information from the forget set while maintaining near-identical performance on the retain set, as demonstrated on ImageNet-1K and Places-365. The framework is generally applicable to various I2I models, including diffusion models, VQ-GAN, and MAE. The method exhibits robustness to limited or unavailable retain samples, offering flexibility in practical applications.	The method is primarily evaluated on I2I generative models and requires access to original forget samples. Future work includes extending the approach to other modalities (text, text-to-image) and exploring practical scenarios for content control and privacy protection.	machine unlearning, generative models, image-to-image, data privacy, copyright protection
2402.00240 Report	Spectral Norm of Convolutional Layers with Circular and Zero Paddings	Blaise Delattre, Quentin Barthélemy, Alexandre Allauzen	This paper leverages the use of \emph{Gram iteration} an efficient, deterministic, and differentiable method for computing spectral norm with an upper bound guarantee. Designed for circular convolutional layers, we generalize the use of the Gram iteration to zero padding convolutional layers and prove its quadratic convergence. We also provide theorems for bridging the gap between circular and zero padding convolution's spectral norm. We design a \emph{spectral rescaling} that can be used as a competitive $1$-Lipschitz layer that enhances network robustness. Demonstrated through experiments, our method outperforms state-of-the-art techniques in precision, computational cost, and scalability. The code of experiments is available at https://github.com/blaisedelattre/lip4conv.	This paper proposes Gram iteration, a novel method for efficiently computing the spectral norm of convolutional layers with circular and zero padding, ensuring an upper bound guarantee and outperforming state-of-the-art techniques.	Spectral norm regularization in CNNs is crucial for enhancing generalization, stabilizing training, and bolstering robustness against adversarial attacks.	The authors leverage Gelfand's formula to generalize Gram iteration for any matrix norm, proving its quadratic convergence. They extend its application to zero padding convolutions and establish theoretical bounds bridging circular and zero padding spectral norms, as well as linking input size to the bound.	Gram iteration provides a guaranteed upper bound on the spectral norm of convolutional layers, unlike power iteration methods. The proposed method achieves superior accuracy in spectral norm estimation compared to existing techniques while maintaining computational efficiency. Spectral Rescaling (SR), a novel 1-Lipschitz layer derived from Gram iteration, demonstrably enhances the robustness of CNNs against adversarial attacks.	Future work includes exploring the adaptability of Gram iteration for computing multiple singular values. Further investigation into the trade-off between the tightness of the spectral norm bound and computational cost is warranted.	convolutional neural networks, spectral norm, adversarial robustness, gram iteration, lipschitz layers
2402.00225 Report	Geometry aware 3D generation from in-the-wild images in ImageNet	Qijia Shen, Guangrun Wang	Generating accurate 3D models is a challenging problem that traditionally requires explicit learning from 3D datasets using supervised learning. Although recent advances have shown promise in learning 3D models from 2D images, these methods often rely on well-structured datasets with multi-view images of each instance or camera pose information. Furthermore, these datasets usually contain clean backgrounds with simple shapes, making them expensive to acquire and hard to generalize, which limits the applicability of these methods. To overcome these limitations, we propose a method for reconstructing 3D geometry from the diverse and unstructured Imagenet dataset without camera pose information. We use an efficient triplane representation to learn 3D models from 2D images and modify the architecture of the generator backbone based on StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and improve the training stability on diverse data, we propose to use multi-view discrimination. The trained generator can produce class-conditional 3D models as well as renderings from arbitrary viewpoints. The class-conditional generation results demonstrate significant improvement over the current state-of-the-art method. Additionally, using PTI, we can efficiently reconstruct the whole 3D geometry from single-view images.	This paper introduces a novel method for generating 3D models from diverse and unstructured 2D image datasets, like ImageNet, without relying on camera pose information.	This approach is significant because it allows for learning 3D representations from widely available 2D data, overcoming limitations of previous methods reliant on structured datasets with multi-view or camera pose data.	The method leverages a triplane representation for 3D modeling and modifies the StyleGAN2 generator architecture for improved learning from diverse datasets. Additionally, it employs a multi-view discrimination technique to enhance training stability and prevent mode collapse.	The model successfully generates class-conditional 3D models and renderings from arbitrary viewpoints, demonstrating significant improvement over existing methods. Training on diverse datasets enables the model to infer plausible 3D shapes even for objects with limited viewpoints in the training data. The method allows for efficient single-view 3D reconstruction using pivotal tuning inversion.	The issue of unknown camera poses for in-the-wild images is not fully addressed, posing a limitation to be explored in future work. Future work could investigate incorporating depth information during training as additional supervision for geometry and explore alternative architectures beyond StyleGAN2.	3d generation, imagenet, triplane representation, multi-view discrimination, single-view reconstruction
2402.00033 Report	LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition	Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li	The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63\% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.	This paper presents LF-ViT, a novel two-stage Vision Transformer framework designed to optimize computational efficiency for high-resolution image recognition by minimizing spatial redundancy.	ViT excels in accuracy but suffers from high computational costs, especially with increasing image resolutions, hindering deployment on resource-limited devices. This paper addresses this by focusing computation on minimal class-discriminative image regions.	LF-ViT employs a two-stage approach: (1) Localization: a down-sampled image is processed, and if a confident prediction isn't reached, a novel Neighborhood Global Class Attention (NGCA) mechanism identifies class-discriminative regions. (2) Focus: these regions are processed from the original image for enhanced recognition, employing feature reuse and fusion mechanisms for further optimization.	LF-ViT significantly reduces Deit-S's FLOPs by 63% while maintaining accuracy, resulting in a 2.03x throughput improvement on an A100 GPU. The NGCA mechanism effectively identifies class-discriminative regions, outperforming other region selection alternatives. LF-ViT consistently surpasses state-of-the-art ViT optimization models in both accuracy and computational efficiency, demonstrating a superior balance between performance and resource usage.	The current implementation of LF-ViT is limited to image classification tasks. Further research is needed to explore the integration of LF-ViT with token pruning methods for enhanced efficiency.	vision transformer, adaptive inference, spatial redundancy, class-discriminative regions, computational efficiency
2401.17992 Report	Multilinear Operator Networks	Yixin Cheng, Grigorios G. Chrysos, Markos Georgopoulos, Volkan Cevher	Despite the remarkable capabilities of deep neural networks in image recognition, the dependence on activation functions remains a largely unexplored area and has yet to be eliminated. On the other hand, Polynomial Networks is a class of models that does not require activation functions, but have yet to perform on par with modern architectures. In this work, we aim close this gap and propose MONet, which relies solely on multilinear operators. The core layer of MONet, called Mu-Layer, captures multiplicative interactions of the elements of the input token. MONet captures high-degree interactions of the input elements and we demonstrate the efficacy of our approach on a series of image recognition and scientific computing benchmarks. The proposed model outperforms prior polynomial networks and performs on par with modern architectures. We believe that MONet can inspire further research on models that use entirely multilinear operations.	Introduces Multilinear Operator Network (MoNet), a Polynomial Network (PN) based solely on multilinear operations, avoiding activation functions while achieving competitive performance to modern architectures.	Addresses the limitations of deep neural networks' reliance on activation functions, which hinders their application in privacy-preserving settings like Fully Homomorphic Encryption.	Proposes a core layer, PolyMLP, utilizing multilinear operations to capture multiplicative interactions within input tokens. The architecture stacks PolyMLPs to capture high-degree interactions, enabling polynomial expansion of input data.	MoNet significantly outperforms prior PNs on ImageNet, achieving over 10% improvement over the previous state-of-the-art. Achieves competitive performance with modern architectures like transformers and MLP-based models on ImageNet and other image recognition benchmarks. Demonstrates strong robustness to image corruptions on ImageNet-C and shows promise in scientific computing by accurately recovering formulas in a polynomial neural ODE solver experiment.	Theoretical characterization of polynomial expansions achievable with MoNet remains to be explored. Future work includes further theoretical analysis of the model's inductive bias and exploration of its potential beyond image recognition.	polynomial networks, activation functions, multilinear operations, image recognition, privacy-preserving machine learning
2401.17948 Report	HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction	Harvie Zhang	The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.	This paper introduces Terminator, a novel neural network architecture that eliminates the need for residual learning by employing large implicit convolution kernels generated by coordinate-based implicit MLPs.	Residual learning, while effective, poses challenges for interpretability and efficient training. Terminator addresses these limitations by enabling full context interaction at each layer through large kernels, enhancing feature extraction capabilities.	The Terminator architecture leverages a novel Slow-Fast Neural Encoding (SFNE) block. This block uses a slow network (coordinate-based MLP) to generate hyper-kernels, and a fast network that interacts with the context via a proposed HyperZZW operator, which efficiently creates context-dependent weights using elementwise multiplication.	Terminator achieves state-of-the-art performance on pixel-level 1D and 2D image classification benchmarks, surpassing residual networks and transformers. The architecture exhibits faster training convergence due to stable zero-mean features. Terminator requires fewer model parameters compared to other state-of-the-art architectures.	The paper acknowledges limitations in evaluating Terminator on larger datasets like ImageNet due to computational constraints. Future work will focus on exploring more effective slow neural loss functions to further improve the accuracy of pixel-level scores.	residual learning, implicit kernels, slow-fast networks, context-dependent weights, image classification
2401.17895 Report	ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields	Edward Bartrum, Thu Nguyen-Phuoc, Chris Xie, Zhengqin Li, Numair Khan, Armen Avetisyan, Douglas Lanman, Lei Xiao	We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene editing method that enables the replacement of specific objects within a scene. Given multi-view images of a scene, a text prompt describing the object to replace, and a text prompt describing the new object, our Erase-and-Replace approach can effectively swap objects in the scene with newly generated content while maintaining 3D consistency across multiple viewpoints. We demonstrate the versatility of ReplaceAnything3D by applying it to various realistic 3D scenes, showcasing results of modified foreground objects that are well-integrated with the rest of the scene without affecting its overall integrity.	Introduces Replace Anything Model (RAM), a text-guided 3D scene editing method using an Erase-and-Replace approach for multi-view consistent object replacement in neural radiance fields.	Addresses the growing demand for efficient 3D content creation and editing tools, particularly for tasks like object replacement in VR/MR, gaming, and film production.	Employs a two-stage Erase-and-Replace approach: 1) Erases target objects and inpaints the background using a text-guided 3D inpainting technique and a Bubble-NeRF representation. 2) Replaces the erased object with a new object generated using a text-guided 3D inpainting technique, ensuring seamless blending and multi-view consistency.	Achieves high-quality object replacement in various 3D scenes, including forward-facing and 360° scenes. Demonstrates superior performance compared to existing methods like Instruct-NeRF2NeRF and Blended-NeRF, particularly in preserving scene structure and generating realistic object details. Extends beyond object replacement to enable object removal and addition with realistic lighting and multi-view consistency.	May remove important structural information from the original objects due to the Erase-and-Replace approach, making it unsuitable for editing tasks requiring preserving original geometry. Suffers from artifacts common to text-to-image model distillation techniques, such as the Janus multi-face problem.	3d scene editing, neural radiance fields, text-guided image inpainting, object replacement, multi-view consistency
2401.17879 Report	AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error	Jonas Ricker, Denis Lukovnikov, Asja Fischer	With recent text-to-image models, anyone can generate deceptively realistic images with arbitrary contents, fueling the growing threat of visual disinformation. A key enabler for generating high-resolution images with low computational cost has been the development of latent diffusion models (LDMs). In contrast to conventional diffusion models, LDMs perform the denoising process in the low-dimensional latent space of a pre-trained autoencoder (AE) instead of the high-dimensional image space. Despite their relevance, the forensic analysis of LDMs is still in its infancy. In this work we propose AEROBLADE, a novel detection method which exploits an inherent component of LDMs: the AE used to transform images between image and latent space. We find that generated images can be more accurately reconstructed by the AE than real images, allowing for a simple detection approach based on the reconstruction error. Most importantly, our method is easy to implement and does not require any training, yet nearly matches the performance of detectors that rely on extensive training. We empirically demonstrate that AEROBLADE is effective against state-of-the-art LDMs, including Stable Diffusion and Midjourney. Beyond detection, our approach allows for the qualitative analysis of images, which can be leveraged for identifying inpainted regions. We release our code and data at https://github.com/jonasricker/aeroblade .	AEROBLADE, a training-free method for detecting images generated by Latent Diffusion Models (LDMs) by exploiting the reconstruction error of the model's autoencoder (AE).	The proliferation of LDMs enables easy creation of hyperrealistic images, posing a significant threat of visual disinformation and necessitating effective detection methods.	AEROBLADE leverages the observation that LDM AEs reconstruct generated images more accurately than real images. It computes the reconstruction error using LPIPS distance between an image and its reconstruction by the AE.	AEROBLADE effectively distinguishes real images from images generated by seven state-of-the-art LDMs, achieving a mean Average Precision (AP) of 0.992. The method doesn't require training, yet performs comparably to extensively trained classifiers. AEROBLADE provides qualitative information about image regions and their reconstructability, enabling identification of inpainted areas.	Achieving optimal performance requires access to the AE of the specific LDM used for generation. Generated images with low complexity (e.g., logos) are more challenging to detect.	image forensics, disinformation detection, latent diffusion models, autoencoder reconstruction error, generative ai
2401.17868 Report	Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model	Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, Chun Yuan	The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM's local prior assumption. Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM's foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA's superiority in adapting SAM to real-world semantic segmentation tasks.	The paper proposes Conv-LoRA, a novel parameter-efficient fine-tuning (PEFT) approach for adapting the Segment Anything Model (SAM) to downstream semantic segmentation tasks.	While SAM excels in zero-shot generalization for generic object segmentation, its performance degrades in specialized domains like medical imaging and remote sensing. This work addresses these limitations by improving SAM's ability to capture local image priors and high-level semantic information.	Conv-LoRA integrates lightweight convolutional layers within the Low-Rank Adaptation (LoRA) framework. It uses a Mixture-of-Experts (MoE) approach to dynamically inject local priors at appropriate feature scales. Additionally, the authors modify SAM's decoder to enable end-to-end multi-class segmentation.	Conv-LoRA consistently outperforms other PEFT methods across diverse datasets spanning multiple domains (natural images, medical, agriculture, remote sensing). Analysis reveals that SAM's pretraining, focused on foreground-background separation, hinders its ability to learn high-level semantics crucial for multi-class segmentation. LoRA helps recover this capability. MoE proves effective in dynamically selecting the proper scale for local prior injection, leading to both performance gains and reduced computational cost compared to a multi-scale approach.	While demonstrating strong general performance, Conv-LoRA may not consistently surpass domain-specific state-of-the-art models. Further tailoring of the mask decoder and prompt encoder might be needed for specific domains. Conv-LoRA introduces a slight computational overhead compared to other PEFT methods due to the upscaling/downscaling operations within MoE. Exploring alternative local prior injection methods without explicit scaling could be beneficial.	semantic segmentation, segment anything model (sam), parameter-efficient fine-tuning (peft), low-rank adaptation (lora), mixture-of-experts (moe)
2401.17857 Report	SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition	Xu Hu, Yuxi Wang, Lue Fan, Junsong Fan, Junran Peng, Zhen Lei, Qing Li, Zhaoxiang Zhang	3D Gaussian Splatting has emerged as an alternative 3D representation for novel view synthesis, benefiting from its high-quality rendering results and real-time rendering speed. However, the 3D Gaussians learned by 3D-GS have ambiguous structures without any geometry constraints. This inherent issue in 3D-GS leads to a rough boundary when segmenting individual objects. To remedy these problems, we propose SAGD, a conceptually simple yet effective boundary-enhanced segmentation pipeline for 3D-GS to improve segmentation accuracy while preserving segmentation speed. Specifically, we introduce a Gaussian Decomposition scheme, which ingeniously utilizes the special structure of 3D Gaussian, finds out, and then decomposes the boundary Gaussians. Moreover, to achieve fast interactive 3D segmentation, we introduce a novel training-free pipeline by lifting a 2D foundation model to 3D-GS. Extensive experiments demonstrate that our approach achieves high-quality 3D segmentation without rough boundary issues, which can be easily applied to other scene editing tasks.	This paper proposes SAGD, a training-free pipeline for interactive and effective segmentation of 3D Gaussian Splatting (3D-GS), addressing the rough boundary issue inherent in existing methods.	Accurate and efficient 3D segmentation in 3D-GS is crucial for scene understanding and editing applications, but existing methods suffer from rough boundaries due to the ambiguous nature of learned Gaussians.	The method leverages 2D foundation model (SAM) to generate multi-view masks from user prompts and introduces a Gaussian Decomposition scheme to decompose boundary Gaussians, thus refining segmentation boundaries. A voting strategy then determines final 3D segmentation.	Achieves high-quality 3D segmentation with smoother boundaries compared to previous methods (SA3D, SAGA). Demonstrates efficiency with significantly less or no training time compared to learning-based approaches. Shows strong performance on various datasets (SPIn-NeRF, LERF, Mip-NeRF 360) and applicability to scene editing and collision detection tasks.	Performance degrades with sparse 3D Gaussian distribution, suggesting future work on structured 3D-GS representation. The confidence score threshold requires manual adjustment depending on scene complexity and view quality.	3d gaussian splatting, 3d segmentation, boundary enhancement, gaussian decomposition, scene editing
2401.17807 Report	Advances in 3D Generation: A Survey	Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan	Generating 3D models lies at the core of computer graphics and has been the focus of decades of research. With the emergence of advanced neural representations and generative models, the field of 3D content generation is developing rapidly, enabling the creation of increasingly high-quality and diverse 3D models. The rapid growth of this field makes it difficult to stay abreast of all recent developments. In this survey, we aim to introduce the fundamental methodologies of 3D generation methods and establish a structured roadmap, encompassing 3D representation, generation methods, datasets, and corresponding applications. Specifically, we introduce the 3D representations that serve as the backbone for 3D generation. Furthermore, we provide a comprehensive overview of the rapidly growing literature on generation methods, categorized by the type of algorithmic paradigms, including feedforward generation, optimization-based generation, procedural generation, and generative novel view synthesis. Lastly, we discuss available datasets, applications, and open challenges. We hope this survey will help readers explore this exciting topic and foster further advancements in the field of 3D content generation.	This paper presents a comprehensive survey of 3D generation methods, encompassing 3D representations, generation techniques, datasets, and applications.	3D content generation is crucial for various applications like video games, movies, and immersive experiences, and has seen rapid advancements due to neural representations and generative models.	The paper categorizes generation methods into four paradigms: feedforward, optimization-based, procedural, and generative novel view synthesis, analyzing each with representative examples.	The survey provides a structured roadmap of 3D generation methodologies, highlighting advancements in generative models, 3D representations, and algorithmic paradigms. It discusses commonly used datasets for 3D generation, categorized by 3D data, multi-view images, and single-view images. The paper explores applications like 3D human, face, and general scene generation, and discusses 3D editing techniques.	A key limitation is the lack of objective metrics to comprehensively evaluate the quality and diversity of generated 3D models. The field still needs large-scale, high-quality 3D datasets, and better utilization of existing 2D data for 3D generation.	3d generation, neural rendering, generative models, scene representations, 3d deep learning
2401.17629 Report	Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models	Kyungsung Lee, Donggyu Lee, Myungjoo Kang	Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.	This paper proposes SaFaRI, a novel diffusion model-based image restoration approach that incorporates spatial and frequency information into the data fidelity term for enhanced restoration performance.	Existing methods for solving noisy inverse problems in image restoration typically rely on pixel-wise data fidelity, which does not fully capture perceptual features important for high-quality image reconstruction.	SaFaRI modifies the data fidelity term using bicubic upsampling for spatial context and Fourier transformation for frequency domain representation, allowing for a more comprehensive representation of perceptual attributes. The method iteratively refines the generated image by minimizing the weighted sum of spatial and frequency-aware data fidelity terms.	SaFaRI achieves state-of-the-art performance on ImageNet and FFHQ datasets, outperforming existing zero-shot image restoration methods in terms of LPIPS and FID metrics. The method effectively restores images across various tasks, including inpainting, denoising, and super-resolution. The use of either spatial or frequency information alone in SaFaRI is sufficient to outperform existing methods, demonstrating the effectiveness of the proposed approach.	The transformation applied to the data fidelity term may introduce perturbations to the feasible solutions due to the influence of the prior term. Future work could involve a comprehensive analysis of these solution perturbations to strengthen the theoretical foundation of the methodology.	image restoration, diffusion models, data fidelity, perceptual quality, spatial and frequency information
2401.17509 Report	Anything in Any Scene: Photorealistic Video Object Insertion	Chen Bai, Zeman Shao, Guoxiang Zhang, Di Liang, Jie Yang, Zhuorui Zhang, Yujian Guo, Chengzhang Zhong, Yiqiao Qiu, Zhendong Wang, Yichen Guan, Xiaoyin Zheng, Tao Wang, Cheng Lu	Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high levels of photorealism. In this paper, we propose Anything in Any Scene, a novel and generic framework for realistic video simulation that seamlessly inserts any object into an existing dynamic video with a strong emphasis on physical realism. Our proposed general framework encompasses three key processes: 1) integrating a realistic object into a given scene video with proper placement to ensure geometric realism; 2) estimating the sky and environmental lighting distribution and simulating realistic shadows to enhance the light realism; 3) employing a style transfer network that refines the final video output to maximize photorealism. We experimentally demonstrate that Anything in Any Scene framework produces simulated videos of great geometric realism, lighting realism, and photorealism. By significantly mitigating the challenges associated with video data generation, our framework offers an efficient and cost-effective solution for acquiring high-quality videos. Furthermore, its applications extend well beyond video data augmentation, showing promising potential in virtual reality, video editing, and various other video-centric applications. Please check our project website https://anythinginanyscene.github.io for access to our project code and more high-resolution video results.	This paper introduces "Anything in Any Scene", a novel framework for realistic video simulation that seamlessly inserts any object into existing dynamic videos with a focus on physical realism.	Existing video simulation methods often struggle to accurately model lighting, object geometry, and photorealism, limiting their application in fields like autonomous driving and robotics.	The framework employs a three-step process: 1) object integration into the scene video with proper placement, 2) sky and environmental lighting estimation and realistic shadow simulation, 3) style transfer network refinement for enhanced photorealism.	The proposed framework generates simulated videos with high geometric, lighting, and photorealism, outperforming other methods. Human studies and Frechet Inception Distance (FID) scores demonstrate the effectiveness of the framework. The framework proves valuable for data augmentation in perception algorithms, improving object detection performance on rare classes.	The placement of objects in constrained indoor scenes can be challenging due to limited space. Future work includes incorporating improved 3D mesh reconstruction methods and exploring new applications beyond data augmentation.	video simulation, photorealism, object insertion, lighting estimation, style transfer
2401.17270 Report	YOLO-World: Real-Time Open-Vocabulary Object Detection	Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan	The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.	Introduces YOLO-World, an efficient open-vocabulary object detector that enhances traditional YOLO with open-vocabulary capabilities via vision-language modeling and large-scale pre-training.	Addresses the limitation of traditional object detectors, like YOLO, being restricted to a fixed set of object categories.	Leverages pre-trained CLIP text encoder and introduces a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). Pre-trains the model on large-scale detection, grounding, and image-text datasets using a region-text contrastive learning scheme.	Achieves state-of-the-art zero-shot performance on the LVIS dataset with 35.4 AP at 52.0 FPS on a V100 GPU. Demonstrates strong generalization capabilities, effectively transferring to downstream tasks like open-vocabulary instance segmentation and referring object detection. Proves the effectiveness of vision-language pre-training for smaller models, allowing for efficient deployment.	Fine-tuning on limited datasets can degrade the generalization ability gained from pre-training. Using excessive amounts of pseudo-labeled data for pre-training can negatively impact smaller models.	open-vocabulary object detection, vision-language pre-training, yolo, region-text contrastive learning, real-time object detection
2401.17258 Report	You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation	Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos	In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.	This paper presents YONOS-SR, a novel stable diffusion-based image super-resolution approach that achieves state-of-the-art results using only a single DDIM step.	Diffusion models are computationally expensive for image super-resolution due to the large number of denoising steps required. YONOS-SR addresses this by enabling high-quality super-resolution with just one step, making it significantly faster.	The paper introduces 'scale distillation', a novel training strategy where a 'student' model learns from a 'teacher' model trained on a smaller magnification scale. This simplifies the super-resolution task, allowing the student to achieve good results with fewer steps. Additionally, the decoder is fine-tuned on top of the frozen one-step diffusion model to further improve quality.	YONOS-SR outperforms state-of-the-art diffusion-based SR methods that require 200 steps, using only one step. Scale distillation significantly improves performance, especially with few steps, by providing a more accurate and noise-adaptive target for training. Fine-tuning the decoder on top of the frozen one-step diffusion model further enhances results.	The model's performance with extremely low-resolution images can be further improved. Exploring the application of scale distillation to other inverse imaging problems, such as image inpainting, is a promising future direction.	image super-resolution, diffusion models, stable diffusion, scale distillation, fast inference
2401.17181 Report	Transfer Learning for Text Diffusion Models	Kehang Han, Kathleen Kenealy, Aditya Barua, Noah Fiedel, Noah Constant	In this report, we explore the potential for text diffusion to replace autoregressive (AR) decoding for the training and deployment of large language models (LLMs). We are particularly interested to see whether pretrained AR models can be transformed into text diffusion models through a lightweight adaptation procedure we call ``AR2Diff''. We begin by establishing a strong baseline setup for training text diffusion models. Comparing across multiple architectures and pretraining objectives, we find that training a decoder-only model with a prefix LM objective is best or near-best across several tasks. Building on this finding, we test various transfer learning setups for text diffusion models. On machine translation, we find that text diffusion underperforms the standard AR approach. However, on code synthesis and extractive QA, we find diffusion models trained from scratch outperform AR models in many cases. We also observe quality gains from AR2Diff -- adapting AR models to use diffusion decoding. These results are promising given that text diffusion is relatively underexplored and can be significantly faster than AR decoding for long text generation.	This paper investigates the potential of adapting pretrained autoregressive language models (LLMs) for non-autoregressive text generation using text diffusion, a method called "AR2Diff".	This work aims to address the limitations of autoregressive decoding in LLMs, particularly its inefficiency in long text generation, by exploring the feasibility of text diffusion as a faster alternative.	The authors compare different model architectures, pretraining objectives, and transfer learning strategies for text diffusion. They also introduce AR2Diff, a method to adapt pretrained AR models for diffusion, and evaluate its performance against AR and diffusion baselines on machine translation, question answering, and code synthesis tasks.	Decoder-only models pretrained with a prefix language modeling objective are found to be most suitable for text diffusion. Text diffusion models can achieve competitive performance with autoregressive models on code synthesis and question answering tasks, but not on machine translation. AR2Diff, especially with longer adaptation stages, can further improve the performance of diffusion models, often surpassing pure diffusion baselines and sometimes approaching autoregressive baselines.	The study primarily focuses on a limited set of tasks and datasets. Further research is needed to explore the full potential of caching and other optimization techniques to enhance the inference speed of text diffusion.	text generation, diffusion models, non-autoregressive models, large language models, transfer learning
2401.17053 Report	BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation	Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, Pan Ji	We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.	Presents BlockFusion, a novel method for generating expansive 3D scenes using latent tri-plane representation and diffusion models. It extrapolates new latent codes for unseen regions, enabling the generation of out-of-bound content.	Existing 3D scene generation methods struggle to generate coherent and expansive scenes due to limited capacity and lack of extrapolation capabilities.	1. Encode 3D scene into latent tri-plane features using a variational autoencoder (VAE). 2. Train a diffusion model on these latent codes. 3. Extrapolate new latent codes for unseen regions by sampling from the diffusion model. 4. Decode the extrapolated latent codes to generate new 3D content.	Generates coherent and expansive 3D scenes with diverse layouts and styles. Outperforms existing methods in terms of scene consistency, diversity, and visual quality. Demonstrates strong generalization ability, enabling the generation of novel content beyond the training set.	Limited control over the generated content. Computational cost for large scenes.	3d scene generation, diffusion model, latent representation, tri-plane, extrapolation
2401.16861 Report	Repositioning the Subject within Image	Yikai Wang, Chenjie Cao, Ke Fan, Qiaole Dong, Yifan Li, Xiangyang Xue, Yanwei Fu	Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image's fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE's effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Results of SEELE on ReS demonstrate its efficacy.	This paper introduces the novel task of subject repositioning in images and proposes the SEELE framework to address it.	Subject repositioning enables dynamic object manipulation within images, pushing beyond static editing techniques.	SEELE employs a single diffusion model guided by learned task prompts (task inversion) to tackle sub-tasks like subject removal, completion, and harmonization.	SEELE effectively repositions subjects in diverse scenes, outperforming Stable Diffusion variants on the ReS dataset. Task inversion proves valuable for adapting a single diffusion model to multiple sub-tasks, improving consistency and quality. SEELE's modular design allows for the incorporation of components like depth estimation and matting, enhancing realism.	SEELE's performance depends on the accuracy of individual components, requiring manual intervention in case of errors. Developing models for open-vocabulary amodal mask generation is crucial for improved subject completion with occlusions.	subject repositioning, image manipulation, diffusion models, task inversion, image inpainting
2401.16764 Report	BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion	Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li	Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.	This paper introduces BoostDream, a highly efficient plug-and-play 3D refining method for transforming coarse 3D assets into high-quality ones by combining advantages of feed-forward and SDS-based methods.	Current text-to-3D generation methods suffer from either coarse results (feed-forward methods) or slow generation speed (SDS-based methods). BoostDream aims to address this trade-off by enabling efficient generation of high-quality 3D assets.	BoostDream consists of three stages: (1) 3D model distillation for initializing differentiable 3D representations from coarse assets; (2) Multi-view SDS loss utilizing a multi-view aware 2D diffusion model for refinement; and (3) Refinement guided by prompt and multi-view consistent normal maps.	BoostDream excels in generating high-quality 3D assets rapidly. It effectively mitigates the Janus problem encountered in conventional SDS-based methods. It demonstrates strong generalizability by being applicable to various 3D differentiable representations.	The current implementation relies on existing 2D diffusion models, inheriting their limitations and biases. Future work could explore optimizing the multi-view rendering system for improved efficiency and exploring alternative control conditions beyond normal maps.	3d generation, text-to-3d, diffusion models, differentiable rendering, multi-view synthesis
2401.16762 Report	Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization	Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang	Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.	Proposes Pick-and-Draw, a training-free semantic guidance approach to enhance identity consistency and generative diversity for text-to-image personalization models.	Existing finetuning-based personalization methods suffer from model overfitting, harming generative diversity, especially with limited subject images.	Uses appearance picking guidance (extracts visual features from reference image to guide subject generation) and layout drawing guidance (utilizes subject's contour from original diffusion model as a template for diverse context synthesis).	Pick-and-Draw consistently improves identity consistency and generative diversity across different personalization methods. Quantitative evaluation on DreamBench dataset shows significant improvement in subject fidelity and image-text alignment. Directly applying Pick-and-Draw to vanilla Stable Diffusion yields surprisingly favorable outcomes, showing potential for training-free single-image personalization.	Pick-and-Draw may fail when the Stable Diffusion template provides incorrect layout priors. Incomplete appearance transfer may occur if the generated subject significantly differs from the reference.	text-to-image personalization, diffusion models, semantic guidance, appearance transfer, layout drawing
2401.16741 Report	MESA: Matching Everything by Segmenting Anything	Yesheng Zhang, Xu Zhao	Feature matching is a crucial task in the field of computer vision, which involves finding correspondences between images. Previous studies achieve remarkable performance using learning-based feature comparison. However, the pervasive presence of matching redundancy between images gives rise to unnecessary and error-prone computations in these methods, imposing limitations on their accuracy. To address this issue, we propose MESA, a novel approach to establish precise area (or region) matches for efficient matching redundancy reduction. MESA first leverages the advanced image understanding capability of SAM, a state-of-the-art foundation model for image segmentation, to obtain image areas with implicit semantic. Then, a multi-relational graph is proposed to model the spatial structure of these areas and construct their scale hierarchy. Based on graphical models derived from the graph, the area matching is reformulated as an energy minimization task and effectively resolved. Extensive experiments demonstrate that MESA yields substantial precision improvement for multiple point matchers in indoor and outdoor downstream tasks, e.g. +13.61% for DKM in indoor pose estimation.	MESA, a novel method for precise area matching based on the Segment Anything Model (SAM), is proposed to effectively reduce matching redundancy in feature matching and promote accurate point matching.	Feature matching suffers from matching redundancy, limiting the accuracy of existing methods. Although matching redundancy can be reduced by high-level image understanding, existing methods are either computationally expensive or rely on impractical semantic segmentation.	MESA constructs a multi-relational Area Graph (AG) to model spatial structures and scale hierarchy of image areas segmented by SAM. Leveraging AG, MESA formulates area matching as an energy minimization problem within a Markov Random Field framework and solves it efficiently using Graph Cut and a learned area similarity model. A global matching energy refinement is further introduced to enhance area matching accuracy by considering the AG structures of both input images.	MESA significantly outperforms the previous semantic segmentation-based area matching method (SGAM) on the ScanNet1500 benchmark. MESA remarkably boosts the accuracy of both semi-dense and dense point matchers for indoor and outdoor relative pose estimation, achieving state-of-the-art results on ScanNet1500 and MegaDepth1500 benchmarks. MESA effectively improves the performance of various point matchers in visual odometry tasks on the KITTI360 dataset.	MESA does not fully exploit the potential of SAM features for area matching. The speed of MESA can be further improved for latency-sensitive applications.	feature matching, matching redundancy, area matching, segment anything model (sam), graphical model
2401.16663 Report	VR-GS: A Physical Dynamics-Aware Interactive Gaussian Splatting System in Virtual Reality	Ying Jiang, Chang Yu, Tianyi Xie, Xuan Li, Yutao Feng, Huamin Wang, Minchen Li, Henry Lau, Feng Gao, Yin Yang, Chenfanfu Jiang	As consumer Virtual Reality (VR) and Mixed Reality (MR) technologies gain momentum, there's a growing focus on the development of engagements with 3D virtual content. Unfortunately, traditional techniques for content creation, editing, and interaction within these virtual spaces are fraught with difficulties. They tend to be not only engineering-intensive but also require extensive expertise, which adds to the frustration and inefficiency in virtual object manipulation. Our proposed VR-GS system represents a leap forward in human-centered 3D content interaction, offering a seamless and intuitive user experience. By developing a physical dynamics-aware interactive Gaussian Splatting in a Virtual Reality setting, and constructing a highly efficient two-level embedding strategy alongside deformable body simulations, VR-GS ensures real-time execution with highly realistic dynamic responses. The components of our Virtual Reality system are designed for high efficiency and effectiveness, starting from detailed scene reconstruction and object segmentation, advancing through multi-view image in-painting, and extending to interactive physics-based editing. The system also incorporates real-time deformation embedding and dynamic shadow casting, ensuring a comprehensive and engaging virtual experience.Our project page is available at: https://yingjiang96.github.io/VR-GS/.	Presents VR-GS, a novel system for real-time, physics-based interaction with 3D scenes represented by Gaussian Splatting.	Traditional 3D content creation is complex and not user-friendly. VR-GS offers an intuitive and accessible way to interact with and edit high-fidelity 3D scenes in real-time.	Combines 3D Gaussian Splatting with Position Based Dynamics (XPBD) using a novel two-level embedding strategy. This allows for real-time simulation of deformable objects represented by Gaussians, enhanced by segmentation, inpainting, and dynamic shadow mapping.	Achieves real-time performance while maintaining high visual fidelity in physics-based interactions. The two-level embedding strategy effectively mitigates spiky artifacts common in deformed Gaussian Splatting. User study confirms significant improvements in immersion and realism compared to traditional transform-based interactions.	Rendering high-fidelity Gaussian Splatting in VR at high resolutions can cause latency issues. Physical parameters are currently manually defined, limiting accessibility for non-expert users.	gaussian splatting, neural radiance fields, virtual reality, physics-based simulation, real-time interaction
2401.16575 Report	Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking	Ivana Beňová, Jana Košecká, Michal Gregor, Martin Tamajka, Marcel Veselý, Marián Šimko	The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy. We focus on studying multimodal models that consider regions of interest (ROI) features obtained by object detectors as input tokens. We probe the understanding of verbs using guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models can predict the correct verb with high accuracy. This contrasts with previous conclusions drawn from image-text matching probing techniques that frequently fail in situations requiring verb understanding. The code for all experiments will be publicly available https://github.com/ivana-13/guided_masking.	This paper introduces 'guided masking', a novel probing technique for evaluating multimodal vision-language transformer models. This method involves masking specific tokens in captions, particularly verbs, and assessing the model's ability to predict them accurately, offering a more nuanced understanding of the model's reasoning compared to traditional image-text matching.	Understanding the fine-grained capabilities of multimodal transformers, especially their grasp of linguistic nuances like verb understanding, is crucial for advancing their interpretability and performance. Existing methods like image-text matching have limitations in providing such insights, necessitating alternative probing techniques like guided masking.	The authors employ 'guided masking' by masking verbs in image captions and evaluating the model's prediction accuracy for the masked word. Additionally, they ablate visual tokens representing the action's subject to assess the model's grounding of verbs in visual features.	Guided masking reveals that the studied models (ViLBERT, LXMERT, UNITER, VisualBERT) achieve over 75% accuracy in predicting masked verbs, indicating a better understanding of verbs than previously suggested by image-text matching methods. Ablating visual tokens associated with the verb's subject leads to a performance drop, highlighting the models' grounding of verbs in visual information. The study demonstrates the limitations of image-text matching for probing, showing instances where models correctly classify mismatched pairs based on object recognition rather than verb understanding.	The study primarily focuses on models utilizing ROI features from object detectors, limiting generalizability to models employing different visual feature representations. Future work can extend guided masking to probe other linguistic aspects, such as objects, attributes, or counting, providing a comprehensive understanding of multimodal transformer capabilities.	multimodal transformers, probing techniques, verb understanding, vision-language models, guided masking
2401.16468 Report	InstructIR: High-Quality Image Restoration Following Human Instructions	Marcos V. Conde, Gregor Geigle, Radu Timofte	Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement. Our code, datasets and models are available at: https://github.com/mv-lab/InstructIR	This paper introduces InstructIR, the first text-guided deep learning model for blind image restoration using human-written instructions to guide restoration.	Existing all-in-one restoration models, while effective, rely on image-based degradation classification and don't leverage users' understanding of what needs fixing. This work leverages the potential of text guidance for improved image restoration.	The authors train InstructIR on a dataset of over 10,000 GPT4-generated prompts paired with degraded/clean images. A text encoder (sentence transformer) maps instructions to embeddings, guiding a NAFNet-based image model enhanced with a novel Instruction Condition Block (ICB) for task-specific feature adaptation.	InstructIR achieves state-of-the-art results on five restoration tasks, outperforming previous all-in-one models by +1dB PSNR. The model generalizes well to various human-written instructions, demonstrating robustness to different language styles and levels of detail. The integration of instructions allows for selective restoration, enabling users to target specific degradations in an image.	InstructIR's current implementation might not achieve the same level of perceptual quality as diffusion-based restoration models. The model struggles with images containing multiple real-world degradations and is limited to in-distribution degradation types.	image restoration, text-guided image editing, instruction following, blind image restoration, all-in-one restoration
2401.16456 Report	SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design	Seokju Yun, Youngmin Ro	Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.	This paper proposes SHViT, a Single-Head Vision Transformer, that achieves state-of-the-art speed-accuracy tradeoff by addressing computational redundancy in both macro and micro architectural design.	Efficient Vision Transformers are crucial for resource-constrained devices, and existing methods suffer from redundancies in architectural design, limiting their efficiency.	The authors analyze spatial and channel redundancy in ViT architectures. They propose a larger-stride patchify stem and a novel Single-Head Self-Attention (SHSA) module to mitigate these redundancies.	SHViT achieves state-of-the-art speed and accuracy on ImageNet-1k classification, outperforming models like EfficientNet-B0 and MobileViTv2. It demonstrates superior performance on object detection and instance segmentation tasks using RetinaNet and Mask-RCNN, surpassing EfficientViT and PoolFormer. SHViT exhibits consistent performance across diverse platforms, including GPUs, CPUs, and mobile devices, with notable speed improvements in ONNX runtime.	While effective, the architecture's macro design might limit its ability to capture fine-grained features. Future work includes exploring cost-effective ways to utilize fine-grained features and integrating the single-head design into existing sophisticated attention mechanisms.	vision transformer, efficient architecture, single-head attention, resource-constrained devices, computer vision
2401.16420 Report	InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang	We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.	InternLM-XComposer2 is a cutting-edge vision-language model excelling in free-form text-image composition and comprehension, surpassing its predecessor and even competing with GPT-4V and Gemini Pro.	This model enables highly customizable content creation by crafting interleaved text-image content from diverse inputs like outlines, textual specifications, and reference images.	The model leverages a Partial LoRA (PLoRA) approach, applying additional LoRA parameters solely to image tokens, and benefits from a high-quality and diverse dataset for training.	InternLM-XComposer2 based on InternLM2-7B outperforms existing open-source MLLMs by a significant margin. It matches or surpasses GPT-4V and Gemini Pro in several benchmarks. It excels in creating high-quality long-text multimodal content and exhibits exceptional vision-language understanding.	The model's performance on college-level benchmarks, while impressive, still has room for improvement. Future work could explore the impact of higher-resolution image inputs on text-image composition tasks.	vision-language model, multimodal understanding, text-image composition, large language model, partial lora
2401.16157 Report	Spatial-Aware Latent Initialization for Controllable Image Generation	Wenqiang Sun, Teng Li, Zehong Lin, Jun Zhang	Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.	This paper introduces a novel approach for enhancing layout control in text-to-image generation by leveraging a spatial-aware initialization noise during the denoising process of diffusion models.	Existing text-to-image diffusion models struggle to accurately adhere to textual instructions regarding spatial layout, limiting their ability to generate images that precisely match user specifications.	The method utilizes the DDIM inversion latent, which retains spatial information from a reference image, as the initialization noise for the image generation process. This spatial-aware latent guides the model to generate objects at the desired positions. An additional attention guidance process further refines the layout during sampling.	The proposed method significantly improves layout accuracy as measured by IoU and mAP@0.5, outperforming state-of-the-art zero-shot layout guidance methods. It maintains competitive image quality as assessed by CLIP score. The approach is efficient, achieving better layout control in fewer optimization steps compared to previous methods.	The method might experience challenges in maintaining prompt alignment due to the focus on spatial guidance, potentially leading to a slight decrease in CLIP score. The choice of background significantly influences generation quality, with pure white backgrounds posing challenges.	text-to-image generation, diffusion models, layout control, ddim inversion, spatial-aware latent
2401.16144 Report	Divide and Conquer: Rethinking the Training Paradigm of Neural Radiance Fields	Rongkai Ma, Leo Lebrat, Rodrigo Santa Cruz, Gil Avraham, Yan Zuo, Clinton Fookes, Olivier Salvado	Neural radiance fields (NeRFs) have exhibited potential in synthesizing high-fidelity views of 3D scenes but the standard training paradigm of NeRF presupposes an equal importance for each image in the training set. This assumption poses a significant challenge for rendering specific views presenting intricate geometries, thereby resulting in suboptimal performance. In this paper, we take a closer look at the implications of the current training paradigm and redesign this for more superior rendering quality by NeRFs. Dividing input views into multiple groups based on their visual similarities and training individual models on each of these groups enables each model to specialize on specific regions without sacrificing speed or efficiency. Subsequently, the knowledge of these specialized models is aggregated into a single entity via a teacher-student distillation paradigm, enabling spatial efficiency for online render-ing. Empirically, we evaluate our novel training framework on two publicly available datasets, namely NeRF synthetic and Tanks&Temples. Our evaluation demonstrates that our DaC training pipeline enhances the rendering quality of a state-of-the-art baseline model while exhibiting convergence to a superior minimum.	This paper introduces DaC, a novel training pipeline for Neural Radiance Fields (NeRFs) that leverages a divide and conquer strategy to improve rendering quality, especially for scenes with intricate geometries.	Standard NeRF training treats all views equally, limiting the rendering quality for complex scenes. DaC aims to overcome this limitation by enabling specialized learning of different scene regions.	DaC divides input views into groups based on visual similarity and trains expert NeRF models on each group. Subsequently, it aggregates the knowledge from these experts into a single model via teacher-student distillation for efficient rendering.	DaC consistently outperforms the standard NeRF training pipeline on both synthetic and real-world benchmark datasets. Dividing scenes into 4 partitions strikes a good balance between performance and efficiency. A balanced number of iterations for distillation and fine-tuning stages yields optimal results.	The current implementation primarily focuses on static scenes and might require adaptations for dynamic scenarios. Future work will explore extending DaC to dynamic scenes and continual learning setups.	neural radiance fields, nerf, novel view synthesis, divide and conquer, knowledge distillation
2401.16087 Report	High Resolution Image Quality Database	Huang Huang, Qiang Wan, Jari Korhonen	With technology for digital photography and high resolution displays rapidly evolving and gaining popularity, there is a growing demand for blind image quality assessment (BIQA) models for high resolution images. Unfortunately, the publicly available large scale image quality databases used for training BIQA models contain mostly low or general resolution images. Since image resizing affects image quality, we assume that the accuracy of BIQA models trained on low resolution images would not be optimal for high resolution images. Therefore, we created a new high resolution image quality database (HRIQ), consisting of 1120 images with resolution of 2880x2160 pixels. We conducted a subjective study to collect the subjective quality ratings for HRIQ in a controlled laboratory setting, resulting in accurate MOS at high resolution. To demonstrate the importance of a high resolution image quality database for training BIQA models to predict mean opinion scores (MOS) of high resolution images accurately, we trained and tested several traditional and deep learning based BIQA methods on different resolution versions of our database. The database is publicly available in https://github.com/jarikorhonen/hriq.	This paper introduces HRIQ, a new high-resolution image quality database containing 1120 images with authentic distortions, rated by 175 users in a controlled lab environment.	Existing large-scale image quality databases primarily contain low-resolution images, limiting the development of BIQA models for high-resolution displays where subtle distortions are more perceptible.	Researchers collected high-resolution images, conducted a subjective quality assessment study in a lab setting, analyzed data for outliers, and evaluated various traditional and deep learning-based BIQA models on different resolution versions of the database.	Traditional BIQA methods perform poorly on HRIQ across all resolutions. Deep learning BIQA models show better performance, but their accuracy varies with resolution. The proposed HR-BIQA, specifically designed for high-resolution images, achieves state-of-the-art performance on the full-resolution database.	Limited diversity in test user demographics (primarily college students). HR-BIQA, while effective for high-resolution, exhibits lower performance on low-resolution images due to its patch-based approach. Future work can explore alternative BIQA architectures optimized for both high and low-resolution images.	image quality assessment, high resolution images, image database, subjective quality assessment, biqa
2401.15977 Report	Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling	Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li	We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.	Presents Motion-I2V, a novel two-stage framework for consistent and controllable image-to-video generation with explicit motion modeling.	Existing I2V methods struggle to maintain temporal consistency and offer limited controllability. Motion-I2V addresses these limitations.	A diffusion-based motion field predictor (stage 1) deduces pixel trajectories. A motion-augmented temporal attention mechanism (stage 2) enhances video generation using predicted motions.	Generates temporally consistent videos even with large motions, outperforming state-of-the-art methods. Offers fine-grained control over motion using sparse trajectory guidance and region-specific animation. Enables zero-shot video-to-video translation by leveraging motion from source videos.	Generated videos exhibit medium brightness due to limitations in noise scheduling. Future work includes exploring improved noise scheduling and further enhancing controllability.	image-to-video generation, diffusion models, motion modeling, controllable generation, video-to-video translation
2401.15975 Report	StableIdentity: Inserting Anybody into Anywhere at First Sight	Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu	Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.	This paper presents StableIdentity, a novel framework that allows for identity-consistent customization of human subjects in text-to-image generation using only a single face image.	Existing methods for customizing the identity of human subjects in generated images struggle with maintaining consistent identity and flexibility across different contexts, especially when trained on limited data.	StableIdentity leverages an encoder pretrained on face recognition for identity prior, and constructs an editable embedding space from celebrity names for editability prior. It also employs a masked two-phase diffusion loss to enhance identity preservation and generation diversity.	StableIdentity outperforms state-of-the-art methods in identity preservation, text-image consistency, and generation quality. The learned identity can be seamlessly integrated with other image manipulation modules like ControlNet. StableIdentity demonstrates impressive generalization ability by successfully injecting learned identities into video and 3D generation models without finetuning.	The method inherits limitations of the base Stable Diffusion model, such as potential hand anomalies. The performance of video customization is limited by the capabilities of current text-to-video generation models.	text-to-image generation, diffusion models, identity customization, one-shot learning, video and 3d generation
2401.15947 Report	MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, Li Yuan	Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.	This paper proposes MoE-LLaVA, a novel sparse Large Vision-Language Model (LVLM) architecture based on Mixture of Experts (MoE), and MoE-Tuning, a three-stage training strategy to address the performance degradation issue in multi-modal sparsity learning.	Scaling LVLMs improves performance but incurs high computational costs. MoE-LLaVA aims to achieve comparable performance with significantly fewer activated parameters, thus reducing computational overhead.	MoE-LLaVA uses a three-stage training approach: 1) MLP training for visual token adaptation, 2) LLM training for multi-modal understanding, and 3) MoE layer training with FFN-initialized experts. This facilitates gradual transition to a sparse model. During inference, only the top-k experts are activated by a router.	MoE-LLaVA achieves comparable performance to state-of-the-art LVLMs on visual understanding benchmarks with only ~3B sparsely activated parameters, significantly fewer than dense models. It outperforms LLaVA-1.5-13B on object hallucination benchmark (POPE) with only 2.2B activated parameters. Analysis reveals that MoE-LLaVA learns specific patterns in expert activation and modality preferences, demonstrating effective sparse multi-modal learning.	Training stability, particularly with 16-bit precision, poses a challenge. Limited multi-modal instruction tuning data hinders exploration of larger MoE-LLaVA models (e.g., 10B+ parameters).	large vision-language models, mixture of experts, sparse models, multi-modal learning, object hallucination
2401.15914 Report	Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization	Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang	Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.	This paper addresses the overfitting issue in finetuned vision-language models for improved out-of-distribution (OOD) generalization.	Existing vision-language models, while demonstrating strong generalization capabilities, often overfit to known classes during finetuning, hindering their performance on novel, unseen classes, which is crucial for real-world applications and safety.	The paper proposes OGEN, a novel approach that: 1) Introduces a class-conditional feature generator to synthesize image features for unknown classes based solely on their names, leveraging the aligned image-text feature spaces of models like CLIP. 2) Employs an adaptive self-distillation mechanism during training, utilizing past model checkpoints as teachers to guide the current model and prevent overfitting on known classes while improving generalization to unknown ones.	OGEN consistently improves new class accuracy across various prompt learning baselines, significantly boosting performance on datasets with substantial inter-class variations. The approach maintains or enhances base class accuracy, demonstrating its ability to balance performance on both known and unknown classes. Ablation studies validate the contribution of both the feature generator and the adaptive self-distillation mechanism to OGEN’s effectiveness.	The paper primarily focuses on prompt learning methods for finetuning, future work could explore its applicability to other finetuning techniques like adaptor tuning. While OGEN shows promise in improving OOD generalization, exploring its capabilities in quantifying uncertainty and evaluating on established OOD detection benchmarks is a potential future direction.	out-of-distribution generalization, vision-language models, prompt learning, feature synthesis, self-distillation
2401.15885 Report	Rectify the Regression Bias in Long-Tailed Object Detection	Ke Zhu, Minghao Fu, Jie Shao, Tianyu Liu, Jianxin Wu	Long-tailed object detection faces great challenges because of its extremely imbalanced class distribution. Recent methods mainly focus on the classification bias and its loss function design, while ignoring the subtle influence of the regression branch. This paper shows that the regression bias exists and does adversely and seriously impact the detection accuracy. While existing methods fail to handle the regression bias, the class-specific regression head for rare classes is hypothesized to be the main cause of it in this paper. As a result, three kinds of viable solutions to cater for the rare categories are proposed, including adding a class-agnostic branch, clustering heads and merging heads. The proposed methods brings in consistent and significant improvements over existing long-tailed detection methods, especially in rare and common classes. The proposed method achieves state-of-the-art performance in the large vocabulary LVIS dataset with different backbones and architectures. It generalizes well to more difficult evaluation metrics, relatively balanced datasets, and the mask branch. This is the first attempt to reveal and explore rectifying of the regression bias in long-tailed object detection.	This paper reveals the detrimental impact of regression bias in long-tailed object detection and introduces three novel methods to mitigate it by enhancing regression for rare categories.	Existing long-tailed object detection methods primarily address classification bias, neglecting the significant impact of the regression branch on detection accuracy, especially for rare categories.	The authors leverage the observation that class-agnostic regression heads benefit rare categories and propose three solutions: 1) adding a class-agnostic branch alongside class-specific ones, 2) clustering similar regression heads based on object scale, and 3) merging heads of specific categories.	Rectifying regression bias consistently improves performance across various existing long-tailed detection methods, particularly for rare categories. The proposed method achieves state-of-the-art results on the LVIS dataset with different backbones and architectures, demonstrating its effectiveness. The method generalizes well to various evaluation metrics, relatively balanced datasets (COCO, COCO-LT), and even to the mask prediction branch in instance segmentation.	The performance improvement from mitigating regression bias is less pronounced when applied to stronger baselines, suggesting potential limitations in backbone model capacity. Adapting the proposed regression methods to one-stage object detectors, which typically employ class-agnostic regression heads, requires further exploration.	long-tailed learning, object detection, regression bias, class-agnostic regression, lvis dataset
2401.15859 Report	Diffusion Facial Forgery Detection	Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli	Detecting diffusion-generated images has recently grown into an emerging research area. Existing diffusion-based datasets predominantly focus on general image generation. However, facial forgeries, which pose a more severe social risk, have remained less explored thus far. To address this gap, this paper introduces DiFF, a comprehensive dataset dedicated to face-focused diffusion-generated images. DiFF comprises over 500,000 images that are synthesized using thirteen distinct generation methods under four conditions. In particular, this dataset leverages 30,000 carefully collected textual and visual prompts, ensuring the synthesis of images with both high fidelity and semantic consistency. We conduct extensive experiments on the DiFF dataset via a human test and several representative forgery detection methods. The results demonstrate that the binary detection accuracy of both human observers and automated detectors often falls below 30%, shedding light on the challenges in detecting diffusion-generated facial forgeries. Furthermore, we propose an edge graph regularization approach to effectively enhance the generalization capability of existing detectors.	This paper introduces DiFF, the first large-scale dataset for diffusion-based facial forgery detection, containing over 500,000 images synthesized using 13 methods under 4 conditions (Text-to-Image, Image-to-Image, Face Swapping, Face Editing).	Existing diffusion-based datasets focus on general image generation and lack the scale and diversity needed to train robust facial forgery detectors.	Researchers collected pristine celebrity images, generated diverse textual and visual prompts, and synthesized forgeries using various diffusion models while maintaining semantic consistency.	Human observers and automated detectors struggle to identify diffusion-generated facial forgeries, often falling below 30% accuracy. Detectors exhibit significant performance drops in cross-domain settings, highlighting the challenge of generalizing across forgery types. The proposed Edge Graph Regularization (EGR) method, incorporating edge graphs into image processing, significantly improves detector generalizability, achieving up to 10% AUC improvement.	DiFF currently focuses on facial forgeries, limiting its generalizability to other domains. Future work includes expanding DiFF with more methods and conditions, and exploring new tasks like traceability and retrieval of diffusion-generated images.	diffusion models, facial forgery detection, dataset, edge graph regularization, deepfakes
2401.15841 Report	2L3: Lifting Imperfect Generated 2D Images into Accurate 3D	Yizheng Chen, Rengan Xie, Qi Ye, Sen Yang, Zixuan Xie, Tianxiao Chen, Rong Li, Yuchi Huo	Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues, respectively. Specifically, we first leverage to decouple the shading information from the generated images to reduce the impact of inconsistent lighting; then, we introduce mono prior with view-dependent transient encoding to enhance the reconstructed normal; and finally, we design a view augmentation fusion strategy that minimizes pixel-level loss in generated sparse views and semantic loss in augmented random views, resulting in view-consistent geometry and detailed textures. Our approach, therefore, enables the integration of a pre-trained MV image generator and a neural network-based volumetric signed distance function (SDF) representation for a single image to 3D object reconstruction. We evaluate our framework on various datasets and demonstrate its superior performance in both quantitative and qualitative assessments, signifying a significant advancement in 3D object reconstruction. Compared with the latest state-of-the-art method Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by about 36\% and improve PSNR by about 30\% .	This paper introduces a novel multi-view 3D reconstruction method specifically designed for imperfect, 'dreamed' images generated by off-the-shelf models.	Existing 3D reconstruction methods struggle with the inconsistencies (lighting, geometry, view sparsity) present in images generated by current multi-view generation models. This work aims to bridge this gap and enable high-quality 3D reconstruction from such imperfect data.	The framework employs a two-stage reconstruction process. Stage 1 reconstructs geometry and albedo using monocular normal priors, per-frame normal encoding, intrinsic decomposition guidance, and view augmentation. Stage 2 reconstructs shaded texture using per-frame color encoding and the geometry from Stage 1.	Significantly improved 3D reconstruction quality (up to 36% lower CD error and 30% higher PSNR) compared to using basic Neus reconstruction with state-of-the-art generation models. Effective handling of inconsistent lighting, misaligned geometry, and view sparsity issues common in generated images. Generalizability and robustness demonstrated through successful application to various multi-view generation models and out-of-domain images.	Reliance on pre-trained models for normal estimation and decomposition, which might introduce limitations depending on their performance. Further research on reducing reliance on pre-trained models and exploring end-to-end training for improved performance.	3d reconstruction, multi-view synthesis, neural rendering, image generation, intrinsic image decomposition
2401.15708 Report	Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding	Jianxiang Lu, Cong Xie, Hui Guo	As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.	This paper presents a novel object-driven one-shot fine-tuning method for text-to-image diffusion models using prototypical embedding, aiming to improve generalizability and fidelity in generating images of user-specified objects.	Existing fine-tuning methods struggle with novel objects in one-shot scenarios, often leading to overfitting or low fidelity in generated images. This method addresses these challenges by enabling the accurate implantation of user-specified objects into a generative model using only a single image while maintaining the model's generalization ability.	The method utilizes prototypical embedding initialized based on the object's appearance and class to improve generalizability. It employs class-characterizing regularization during fine-tuning to preserve prior knowledge of object classes. Additionally, it introduces an object-specific loss function supervised by the object in the input image to enhance fidelity.	The method effectively mitigates overfitting and enables the generation of images that accurately reflect the user-specified object. It preserves the prior knowledge of object classes, leading to improved diversity and naturalness in the synthesized images. The object-specific loss function enhances fidelity by focusing on the object region during training and supports the implantation of multiple objects.	The method may exhibit limitations in handling objects with complex edges, leading to potential degradation in the quality of generated image edges. Fidelity might be slightly compromised when implanting smaller objects. Future work will focus on improving mask acquisition and incorporating a multi-scale perception mechanism.	object-driven, one-shot, diffusion model, prototypical embedding, text-to-image synthesis
2401.15688 Report	Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation	Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, Zhenguo Li	Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a large language model (LLM) agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.	This paper proposes CompAgent, a training-free approach for compositional text-to-image generation using an LLM agent for divide-and-conquer image synthesis based on complex text prompts.	Existing text-to-image models struggle to accurately represent object attributes and relationships within complex scenes, limiting their controllability.	An LLM agent decomposes complex text prompts into individual objects and scene layouts, then leverages a toolkit including multi-concept customization, layout-to-image generation, and local image editing tools to compose the final image. A verification and feedback mechanism further enhances accuracy.	CompAgent shows significant improvement in compositional text-to-image generation, achieving over 10% improvement on the T2I-CompBench benchmark. The LLM agent effectively plans and selects appropriate tools based on text prompt analysis, improving object attribute binding and relationship representation. The method exhibits flexibility for extension to tasks like multi-concept customization, image editing, and object placement.	The reliance on multiple tools and models could increase computational cost. Further exploration of LLM agents with enhanced reasoning and planning capabilities could lead to improved performance.	text-to-image generation, compositional generation, llm agent, image editing, layout-to-image
2401.15687 Report	Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance	Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu	The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.	This paper proposes Media2Face, a diffusion-based model that generates realistic and expressive 3D facial animations from diverse media inputs, including audio, text, and images.	Existing methods for synthesizing 3D facial animations from speech often lack realism and flexible conditioning due to limited training data and control mechanisms. This work aims to overcome these limitations and generate more compelling and controllable animations.	The authors introduce a new neural representation called GNPFA to capture fine-grained facial expressions and head poses. They use GNPFA to build M2F-D, a large and diverse 4D facial animation dataset. Then, they train Media2Face, a latent diffusion model, on M2F-D to generate animations conditioned on audio, text, and image inputs using a multi-classifier-free guidance approach.	Media2Face achieves state-of-the-art performance in lip synchronization accuracy, facial expression stylization, and rhythmic head movement synthesis. The model allows for keyframe editing and CLIP-guided style editing, enabling fine-grained control over the generated animations. User studies confirm that Media2Face generates more realistic and expressive animations than existing methods.	The current implementation of real-time generation is limited to 30fps. The model might struggle with generating animations for unseen languages or highly exaggerated expressions.	facial animation, diffusion models, speech synthesis, multi-modal learning, computer graphics
2401.15652 Report	Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach	Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan	Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for $N$ times to obtain a final multiple which is $N$ times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and \textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method.	This paper proposes PQDiff, a novel image outpainting method that utilizes relative positional queries and a diffusion-based generator to achieve outpainting with arbitrary, continuous multiples in a single step.	Image outpainting, while important for content generation, is limited by existing methods that require multiple steps for large expansions and lack flexibility in specifying expansion multiples. PQDiff addresses these limitations with improved efficiency and controllability.	PQDiff leverages a positional query scheme, randomly cropping training images to create anchor and target views. This allows the model to learn arbitrary relative positional information and generate images with continuous expansion multiples in one step.	PQDiff achieves state-of-the-art FID scores on Scenery, Building Facades, and WikiArts datasets for 11.7x outpainting. Significantly faster generation speed compared to previous methods, requiring only 10.2% of the time for 11.7x outpainting. Demonstrates the ability to outpaint at arbitrary positions within the image, not just surrounding regions.	The performance of PQDiff can be influenced by the random crop ratio used during training. Further exploration of integrating pre-trained models into the PQDiff framework for enhanced consistency is a potential avenue for future work.	image outpainting, diffusion models, positional embeddings, generative models, content generation
2401.15636 Report	FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models	Feihong He, Gang Li, Mengyuan Zhang, Leilei Yan, Lingyu Si, Fanzhang Li	The rapid development of generative diffusion models has significantly advanced the field of style transfer. However, most current style transfer methods based on diffusion models typically involve a slow iterative optimization process, e.g., model fine-tuning and textual inversion of style concept. In this paper, we introduce FreeStyle, an innovative style transfer method built upon a pre-trained large diffusion model, requiring no further optimization. Besides, our method enables style transfer only through a text description of the desired style, eliminating the necessity of style images. Specifically, we propose a dual-stream encoder and single-stream decoder architecture, replacing the conventional U-Net in diffusion models. In the dual-stream encoder, two distinct branches take the content image and style text prompt as inputs, achieving content and style decoupling. In the decoder, we further modulate features from the dual streams based on a given content image and the corresponding style text prompt for precise style transfer. Our experimental results demonstrate high-quality synthesis and fidelity of our method across various content images and style text prompts. The code and more results are available at our project website:https://freestylefreelunch.github.io/.	This paper introduces FreeStyle, a novel text-guided style transfer method that leverages pre-trained large text-guided diffusion models to perform style transfer without any optimization or the need for reference style images.	Existing style transfer methods based on diffusion models rely on time-consuming optimization processes or require reference style images, limiting their practicality. FreeStyle addresses these limitations by directly utilizing the style generation capabilities of pre-trained diffusion models.	FreeStyle employs a dual-stream encoder and a single-stream decoder architecture. The dual-stream encoder separately processes the content image and style text prompt, while the single-stream decoder modulates and fuses the extracted features for style transfer.	FreeStyle generates high-quality stylized images with accurate style expression and content preservation across diverse content images and style text prompts. Qualitative comparisons demonstrate that FreeStyle outperforms existing methods in terms of visual quality, artistic consistency, and robustness. Quantitative evaluations using CLIP Score and human preference studies further validate FreeStyle's superiority over state-of-the-art methods.	FreeStyle's performance is influenced by the quality and diversity of the pre-trained diffusion model used. Fine-grained control over specific style elements within the image might require further exploration.	style transfer, diffusion models, text-guided synthesis, training-free, feature modulation
2401.15318 Report	Gaussian Splashing: Dynamic Fluid Synthesis with Gaussian Splatting	Yutao Feng, Xiang Feng, Yintong Shang, Ying Jiang, Chang Yu, Zeshun Zong, Tianjia Shao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang	We demonstrate the feasibility of integrating physics-based animations of solids and fluids with 3D Gaussian Splatting (3DGS) to create novel effects in virtual scenes reconstructed using 3DGS. Leveraging the coherence of the Gaussian splatting and position-based dynamics (PBD) in the underlying representation, we manage rendering, view synthesis, and the dynamics of solids and fluids in a cohesive manner. Similar to Gaussian shader, we enhance each Gaussian kernel with an added normal, aligning the kernel's orientation with the surface normal to refine the PBD simulation. This approach effectively eliminates spiky noises that arise from rotational deformation in solids. It also allows us to integrate physically based rendering to augment the dynamic surface reflections on fluids. Consequently, our framework is capable of realistically reproducing surface highlights on dynamic fluids and facilitating interactions between scene objects and fluids from new views. For more information, please visit our project page at \url{https://amysteriouscat.github.io/GaussianSplashing/}.	Gaussian Splashing (GSP) is a novel framework that integrates physics-based animation of fluids and solids with 3D Gaussian Splatting (3DGS) for creating dynamic effects in reconstructed 3D scenes.	Existing NeRF/3DGS-based dynamic scene reconstruction methods lack the ability to realistically simulate and render fluid-solid interactions, limiting their applications.	GSP combines position-based dynamics (PBD) with 3DGS. It uses Gaussian kernels for both scene representation and PBD discretization. The framework employs anisotropy loss to maintain rendering quality under large deformations and integrates a Gaussian shader for dynamic specular reflection. It also utilizes AI inpainting to fill missing textures caused by object displacement.	GSP enables realistic two-way coupled fluid-solid interaction within 3DGS scenes. It achieves high-quality rendering of dynamic fluids with specular highlights. The framework allows for interactive scene editing, such as transforming objects into fluids.	The current PBD-based simulation, while versatile, has limitations in physical accuracy and could be enhanced with more sophisticated meshless methods. Fluid rendering, particularly the handling of refraction and the computational cost associated with a large number of fluid particles, requires further improvement.	3d gaussian splatting, fluid simulation, position-based dynamics, dynamic scene reconstruction, novel view synthesis
2401.14828 Report	TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts	Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan	Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region. With the image prompt, users can conveniently specify the detailed appearance/style of the target content in complement to the text description, enabling accurate control of the appearance. Specifically, TIP-Editor employs a stepwise 2D personalization strategy to better learn the representation of the existing scene and the reference image, in which a localization loss is proposed to encourage correct object placement as specified by the bounding box. Additionally, TIPEditor utilizes explicit and flexible 3D Gaussian splatting as the 3D representation to facilitate local editing while keeping the background unchanged. Extensive experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region, consistently outperforming the baselines in editing quality, and the alignment to the prompts, qualitatively and quantitatively.	Presents TIP-Editor, a 3D scene editing framework that allows users to edit existing scenes using both text and image prompts within a user-specified 3D bounding box, offering accurate control over the appearance and location of the edit.	Existing text-driven 3D scene editing methods lack accurate control over the appearance and location of the editing result due to the inherent limitations of the text description.	TIP-Editor employs a stepwise 2D personalization strategy to learn representations of the existing scene and the reference image. It utilizes explicit and flexible 3D Gaussian splatting (GS) for the 3D scene representation, facilitating local editing while preserving the background. A localization loss is introduced during personalization to ensure accurate object placement.	TIP-Editor accurately captures unique characteristics specified in the reference images, offering superior controllability. It supports sequential editing, allowing multiple modifications without noticeable quality degradation. Both qualitative and quantitative evaluations demonstrate TIP-Editor's superiority in editing quality, visual fidelity, and user satisfaction compared to existing methods.	The reliance on coarse bounding box input can be problematic in complex scenes where bounding boxes might include unwanted elements. Extracting a smooth and accurate mesh from GS-represented scenes for further geometric manipulation remains a challenge.	3d scene editing, text-guided image editing, image-guided image editing, 3d gaussian splatting, score distillation sampling
2401.14754 Report	VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising	Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond Chan, Tieyong Zeng	Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly.	This paper proposes Video Joint Task (VJT), a novel multi-tier video transformer framework for the joint task of video deblurring, low-light enhancement, and denoising.	Real-world videos often suffer from multiple degradations simultaneously (blur, low light, noise), necessitating a joint approach for optimal restoration.	The VJT employs a multi-tier decoder structure with feature fusion between tiers to progressively learn features for the three subtasks. An adaptive weighting scheme balances the multiple loss functions, accelerating training and enhancing results.	VJT outperforms state-of-the-art methods (e.g., RVRT, LEDNet) on the proposed Multi-scene Lowlight-Blur-Noise (MLBN) dataset, achieving a PSNR of 25.45dB and SSIM of 0.8083. The multi-tier architecture with feature fusion significantly improves restoration quality compared to single-tier methods. Adaptive weighting scheme effectively balances loss functions, leading to faster training convergence and improved performance compared to fixed-weight methods.	The computational cost of the multi-tier transformer architecture is relatively high, limiting real-time applicability. The MLBN dataset, while designed to approximate real-world scenes, is still synthetic and may not fully capture the complexities of real-world degradations.	video restoration, video deblurring, low-light enhancement, video denoising, video transformer
2401.14425 Report	No Longer Trending on Artstation: Prompt Analysis of Generative AI Art	Jon McCormack, Maria Teresa Llano, Stephen James Krol, Nina Rajcic	Image generation using generative AI is rapidly becoming a major new source of visual media, with billions of AI generated images created using diffusion models such as Stable Diffusion and Midjourney over the last few years. In this paper we collect and analyse over 3 million prompts and the images they generate. Using natural language processing, topic analysis and visualisation methods we aim to understand collectively how people are using text prompts, the impact of these systems on artists, and more broadly on the visual cultures they promote. Our study shows that prompting focuses largely on surface aesthetics, reinforcing cultural norms, popular conventional representations and imagery. We also find that many users focus on popular topics (such as making colouring books, fantasy art, or Christmas cards), suggesting that the dominant use for the systems analysed is recreational rather than artistic.	This paper investigates the use of text prompts in text-to-image (TTI) AI art generation, analyzing over 3 million prompts from Stable Diffusion and Midjourney (2022-2023) to understand user trends and the impact of these systems on visual culture.	The rapid adoption of TTI systems raises concerns about bias, artistic homogenization, and the impact on human artists. Understanding how people utilize these systems is crucial to assess their influence on visual art and culture.	The study employs natural language processing, topic analysis, and data visualization techniques to analyze prompt datasets from Stable Diffusion and Midjourney. It examines trends in prompt usage, stylistic references, artist mentions, and the content of generated images.	Prompting in TTI systems often prioritizes achieving desired visual aesthetics over conveying unique artistic ideas, as evidenced by the prevalence of terms like 'cinematic lighting' and 'photorealistic'. Analysis reveals a significant bias toward popular and conventional artistic styles, potentially leading to aesthetic homogenization and the reinforcement of existing norms. The study finds a dominant focus on generating images of women, particularly in genres like fantasy art and anime, highlighting potential biases and the reinforcement of stereotypes.	The study is limited to data from Stable Diffusion and Midjourney, and future research should include data from other popular TTI systems like DALL-E and Leonardo. Future work could investigate the agency exerted by TTI systems on human users and how their inherent properties might shape future image production.	generative ai, prompting, visual arts & culture, text-to-image, ai art
2401.14405 Report	Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities	Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue	We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.	This paper proposes Multimodal Pathway, a framework to improve the performance of a transformer on a specific modality using irrelevant data from other modalities.	Existing multimodal learning methods rely on paired or interleaved data, requiring strong relevance between samples. This work explores improving models with irrelevant data, addressing an open problem in the field.	The method uses two transformers, one trained on the target modality and another on an auxiliary modality. Cross-Modal Re-parameterization connects the models, allowing the target model to leverage the auxiliary model's weights during training without inference cost.	M2PT consistently improves performance across image, video, point cloud, and audio modalities. The method is effective even when auxiliary model weights are fixed during fine-tuning, demonstrating the transferability of learned knowledge. Empirical studies suggest the improvements stem from the auxiliary model's ability to enhance hierarchical representations, not just better initialization.	The theoretical explanation behind the performance improvements needs further investigation. Future work will explore extending Multimodal Pathways to CNNs and cross-architecture scenarios.	multimodal learning, transformer, re-parameterization, modality-complementary knowledge, hierarchical representation
2401.14404 Report	Deconstructing Denoising Diffusion Models for Self-Supervised Learning	Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He	In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive procedure allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.	This paper investigates the representation learning capabilities of Denoising Diffusion Models (DDMs) by deconstructing them into classical Denoising Autoencoders (DAEs). It identifies key components contributing to DDM's representation learning and proposes a simplified DAE architecture.	The study aims to understand how various components of modern DDMs affect self-supervised representation learning and to bridge the gap between classical DAEs and modern DDMs.	The authors deconstruct a DDM step-by-step, simplifying the tokenizer and removing DDM-specific components to approach a classical DAE while evaluating the representation learning performance at each step.	A low-dimensional latent space, rather than tokenizer specifics, is crucial for DDM's representation learning. A simple DAE with patch-wise PCA tokenizer and multi-level noise achieves competitive self-supervised learning performance. DDM's representation learning capability stems primarily from the denoising process, not diffusion.	Autoencoder-based methods, including the proposed one, still lag behind contrastive learning. The study primarily focuses on ImageNet and linear probing protocol.	denoising diffusion models, denoising autoencoders, self-supervised learning, representation learning, computer vision
2401.14398 Report	pix2gestalt: Amodal Segmentation by Synthesizing Wholes	Ege Ozguroglu, Ruoshi Liu, Dídac Surís, Dian Chen, Achal Dave, Pavel Tokmakov, Carl Vondrick	We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.	Introduces pix2gestalt, a framework for zero-shot amodal segmentation that leverages pre-trained diffusion models to estimate the shape and appearance of partially occluded objects.	Amodal completion is crucial for various applications in vision, graphics, and robotics. Existing methods struggle to generalize beyond closed-world settings.	Fine-tunes a pre-trained diffusion model on a synthetic dataset of occluded objects paired with their whole counterparts. The model takes an RGB image and a point prompt as input and generates the whole object behind occlusions.	Achieves state-of-the-art amodal segmentation results in a zero-shot setting, outperforming supervised baselines on established benchmarks. Significantly improves the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions. Generates diverse and plausible completions, handling uncertainty in occlusion scenarios.	Limitations in situations requiring commonsense or physical reasoning. Future work could explore incorporating such reasoning abilities into the model.	amodal segmentation, zero-shot learning, diffusion models, object recognition, 3d reconstruction
2401.14391 Report	Rethinking Patch Dependence for Masked Autoencoders	Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg	In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io	CrossMAE, a masked autoencoder that uses cross-attention between visible and masked image patches for reconstruction, eliminating self-attention among masked patches.	Self-attention in the decoder of masked autoencoders is computationally expensive and may not be necessary for good representation learning.	CrossMAE replaces self-attention with cross-attention in the decoder, enabling partial reconstruction and incorporating inter-block attention to leverage features from multiple encoder blocks.	CrossMAE achieves comparable or superior performance to MAE on ImageNet classification and COCO instance segmentation with 2.5-3.7x less decoding compute. Partial reconstruction, decoding only a subset of masked patches, maintains performance while boosting efficiency. Inter-block attention, allowing decoder blocks to leverage features from different encoder blocks, further improves representation learning.	Exploration of more efficient inter-block attention mechanisms. Investigation into the role of self-attention in masked visual pretraining and potential alternatives.	masked autoencoders, self-supervised learning, cross-attention, vision transformers, representation learning
2401.14257 Report	Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation	Minglin Chen, Weihao Yuan, Yukun Wang, Zhe Sheng, Yisheng He, Zilong Dong, Liefeng Bo, Yulan Guo	Recently, text-to-3D approaches have achieved high-fidelity 3D content generation using text description. However, the generated objects are stochastic and lack fine-grained control. Sketches provide a cheap approach to introduce such fine-grained control. Nevertheless, it is challenging to achieve flexible control from these sketches due to their abstraction and ambiguity. In this paper, we present a multi-view sketch-guided text-to-3D generation framework (namely, Sketch2NeRF) to add sketch control to 3D generation. Specifically, our method leverages pretrained 2D diffusion models (e.g., Stable Diffusion and ControlNet) to supervise the optimization of a 3D scene represented by a neural radiance field (NeRF). We propose a novel synchronized generation and reconstruction method to effectively optimize the NeRF. In the experiments, we collected two kinds of multi-view sketch datasets to evaluate the proposed method. We demonstrate that our method can synthesize 3D consistent contents with fine-grained sketch control while being high-fidelity to text prompts. Extensive results show that our method achieves state-of-the-art performance in terms of sketch similarity and text alignment.	Presents Sketch2NeRF, a novel framework for multi-view sketch-guided 3D object generation using neural radiance fields (NeRF) optimized with pretrained 2D diffusion models (Stable Diffusion and ControlNet) for fine-grained control.	Addresses limitations of existing text-to-3D methods that lack fine-grained controllability and introduces a method for generating 3D objects from multi-view sketches, a more intuitive and expressive way to specify object structure than text.	Leverages pretrained 2D diffusion models to supervise the optimization of a NeRF, employing a novel synchronized generation and reconstruction mechanism. An annealed time schedule enhances generation quality by gradually reducing noise during optimization.	Generates high-fidelity 3D objects that accurately reflect the structure and details of the input multi-view sketches. Exhibits better 3D consistency than text-to-3D methods, alleviating issues like the 'Janus' problem. Achieves state-of-the-art performance in terms of sketch similarity and text alignment on collected multi-view sketch datasets.	The quality of generated objects degrades with increased noise in sketch poses, highlighting a dependence on accurate sketch alignment. The generation process is computationally expensive, taking around 2 hours on a single NVIDIA RTX 3090 GPU.	text-to-3d, sketch-based 3d generation, neural radiance fields (nerf), diffusion models, controllable generation
2401.14159 Report	Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks	Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang	We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.	Introduces "Grounded SAM," a framework that combines Grounding DINO (an open-set object detector) with the Segment Anything Model (SAM) for open-vocabulary object detection and segmentation using text inputs.	Addresses the limitations of existing visual perception models in handling complex open-world scenarios, specifically targeting the challenge of open-set segmentation.	Leverages the strengths of Grounding DINO for text-to-box mapping and SAM for box-to-mask mapping, effectively achieving text-to-mask segmentation. Further extends Grounded SAM by integrating other models for tasks like automatic image annotation, image editing, and human motion analysis.	Enables detection and segmentation of objects in images based on arbitrary text inputs, including long-tail categories. Achieves state-of-the-art performance on the SegInW (Segmentation in the wild) zero-shot benchmark, demonstrating superior open-vocabulary segmentation capabilities. Provides a versatile framework for building diverse AI systems by integrating other expert models for applications like automatic image annotation, controllable image editing, and human motion analysis.	Reliance on the accuracy of the underlying expert models (e.g., Grounding DINO, SAM). Potential limitations in handling complex scenes with overlapping or partially visible objects.	open-vocabulary segmentation, grounded segmentation, open-world vision, foundation model assembling, multimodal ai
2401.14069 Report	Neural Sinkhorn Gradient Flow	Huminhao Zhu, Fangyikang Wang, Chao Zhang, Hanbin Zhao, Hui Qian	Wasserstein Gradient Flows (WGF) with respect to specific functionals have been widely used in the machine learning literature. Recently, neural networks have been adopted to approximate certain intractable parts of the underlying Wasserstein gradient flow and result in efficient inference procedures. In this paper, we introduce the Neural Sinkhorn Gradient Flow (NSGF) model, which parametrizes the time-varying velocity field of the Wasserstein gradient flow w.r.t. the Sinkhorn divergence to the target distribution starting a given source distribution. We utilize the velocity field matching training scheme in NSGF, which only requires samples from the source and target distribution to compute an empirical velocity field approximation. Our theoretical analyses show that as the sample size increases to infinity, the mean-field limit of the empirical approximation converges to the true underlying velocity field. To further enhance model efficiency on high-dimensional tasks, a two-phase NSGF++ model is devised, which first follows the Sinkhorn flow to approach the image manifold quickly ($\le 5$ NFEs) and then refines the samples along a simple straight flow. Numerical experiments with synthetic and real-world benchmark datasets support our theoretical results and demonstrate the effectiveness of the proposed methods.	Introduces Neural Sinkhorn Gradient Flow (NSGF), a model that uses neural networks to approximate the velocity field of the Wasserstein Gradient Flow with respect to the Sinkhorn divergence for efficient inference between probability distributions.	WGFs are important for machine learning, but existing methods can be computationally expensive. NSGF offers an efficient alternative by using neural networks to approximate the flow.	The authors utilize a velocity field matching training scheme, which learns the velocity field by minimizing the difference between a neural network approximation and an empirical velocity field estimated from samples of the source and target distributions.	Theoretical analysis shows that the mean-field limit of the empirical velocity field approximation converges to the true underlying velocity field as sample size increases. A two-phase NSGF++ model improves efficiency on high-dimensional tasks by combining Sinkhorn flow and straight flow. Experiments on synthetic and real-world datasets demonstrate the effectiveness of NSGF and NSGF++.	The current analysis focuses on the mean-field limit and assumes a specific form for the empirical approximation. Future work could explore different neural network architectures and training objectives to further improve the model's performance.	wasserstein gradient flow, sinkhorn divergence, neural networks, probability distributions, inference
2401.13992 Report	Diffusion-based Data Augmentation for Object Counting Problems	Zhen Wang, Yuelei Li, Jia Wan, Nuno Vasconcelos	Crowd counting is an important problem in computer vision due to its wide range of applications in image understanding. Currently, this problem is typically addressed using deep learning approaches, such as Convolutional Neural Networks (CNNs) and Transformers. However, deep networks are data-driven and are prone to overfitting, especially when the available labeled crowd dataset is limited. To overcome this limitation, we have designed a pipeline that utilizes a diffusion model to generate extensive training data. We are the first to generate images conditioned on a location dot map (a binary dot map that specifies the location of human heads) with a diffusion model. We are also the first to use these diverse synthetic data to augment the crowd counting models. Our proposed smoothed density map input for ControlNet significantly improves ControlNet's performance in generating crowds in the correct locations. Also, Our proposed counting loss for the diffusion model effectively minimizes the discrepancies between the location dot map and the crowd images generated. Additionally, our innovative guidance sampling further directs the diffusion process toward regions where the generated crowd images align most accurately with the location dot map. Collectively, we have enhanced ControlNet's ability to generate specified objects from a location dot map, which can be used for data augmentation in various counting problems. Moreover, our framework is versatile and can be easily adapted to all kinds of counting problems. Extensive experiments demonstrate that our framework improves the counting performance on the ShanghaiTech, NWPU-Crowd, UCF-QNRF, and TRANCOS datasets, showcasing its effectiveness.	This paper presents a novel framework leveraging diffusion models for data augmentation in object counting tasks, enhancing the training of counting models by generating synthetic images with precise control over object location and density.	Existing crowd counting datasets are limited in size, leading to overfitting in deep learning models. This framework addresses this challenge by synthesizing diverse and realistic training images, improving model generalization and performance.	The framework utilizes a pre-trained diffusion model (ControlNet) with several key modifications: 1) Density maps derived from location dot maps are used as input to guide object generation. 2) A counting loss function enforces accurate object placement during training. 3) A counting-guided sampling strategy refines object locations in generated images.	The method generates synthetic crowd images that accurately reflect the specified density and spatial distribution from location dot maps. Training counting models with the augmented dataset leads to significant performance improvements across various crowd counting benchmarks (ShanghaiTech, NWPU-Crowd, UCF-QNRF). The framework demonstrates versatility by effectively augmenting data for vehicle counting on the TRANCOS dataset, highlighting its adaptability to different object counting tasks.	There might be a trade-off between image quality and strict adherence to location maps due to modifications in the loss function and sampling process. Future work could explore techniques to further improve generated image quality while maintaining accurate object correspondence.	data augmentation, object counting, diffusion models, crowd counting, controlnet
2401.13974 Report	BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models	Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik	Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.	This paper proposes BootPIG, a novel architecture that enables zero-shot subject-driven generation in text-to-image models by injecting learned reference image features into a pretrained diffusion model.	Personalized image generation, the ability to generate images of specific objects in user-defined contexts, has numerous applications but current methods require time-consuming finetuning or lack fidelity to the reference object.	BootPIG uses two UNets: one extracts features from reference images and the other (modified with Reference Self-Attention layers) generates images conditioned on these features. The model is trained using a novel bootstrapping procedure that generates synthetic training data from pretrained text-to-image models, chat agents, and segmentation models.	BootPIG outperforms existing zero-shot methods and achieves comparable performance to test-time finetuned methods on standard metrics. User studies demonstrate a preference for BootPIG generations over existing methods in terms of both subject fidelity and prompt fidelity. BootPIG can be trained efficiently, requiring only approximately 1 hour on 16 A100 GPUs.	BootPIG may struggle with prompts that significantly modify the subject's appearance or require fine-grained details. The method inherits limitations and biases from the underlying generative model.	text-to-image generation, personalized image generation, subject-driven generation, diffusion models, zero-shot learning
2401.13942 Report	StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models	Mohan Zhou, Yalong Bai, Qing Yang, Tiejun Zhao	The ability to fine-tune generative models for text-to-image generation tasks is crucial, particularly facing the complexity involved in accurately interpreting and visualizing textual inputs. While LoRA is efficient for language model adaptation, it often falls short in text-to-image tasks due to the intricate demands of image generation, such as accommodating a broad spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a specialized fine-tuning approach tailored for text-to-image models. StyleInject comprises multiple parallel low-rank parameter matrices, maintaining the diversity of visual features. It dynamically adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. This approach significantly minimizes the impact on the original model's text-image alignment capabilities while adeptly adapting to various styles in transfer learning. StyleInject proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models. Our comprehensive experiments, including both small-sample and large-scale data fine-tuning as well as base model distillation, show that StyleInject surpasses traditional LoRA in both text-image semantic consistency and human preference evaluation, all while ensuring greater parameter efficiency.	Introduces StyleInject, a parameter-efficient fine-tuning approach for text-to-image diffusion models that improves upon LoRA by dynamically adapting to various styles while maintaining semantic consistency.	Addresses the limitations of LoRA in text-to-image generation, which often struggles with stylistic diversity and preserving text-image alignment.	Employs dynamic multi-style adaptation with a style router for instance-wise feature adaptation and uses AdaIN for style transfer, enabling fine-grained control over visual features.	Outperforms LoRA in data-driven fine-tuning, achieving better text-image semantic consistency and human preference scores. Effectively distills knowledge from community-fine-tuned SDMs, transferring stylistic elements while maintaining the original model's capabilities. Demonstrates improved performance in DreamBooth, enabling the generation of customized subjects with higher quality and consistency.	The optimal number of training epochs can vary significantly across different experimental settings, requiring careful monitoring and potential early stopping. Further research could explore extending StyleInject to other generative models beyond diffusion models.	text-to-image generation, diffusion models, parameter efficient tuning, style transfer, model distillation
2401.13795 Report	Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All	Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar	As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.	Introduce "Diffuse to Choose" (DTC), a novel diffusion-based image-conditioned inpainting model for Virtual Try-All that balances fast inference with high-fidelity detail retention.	To address the need for an efficient and effective solution for virtual product visualization in online shopping, enabling customers to digitally "try" any product in any setting.	Incorporates fine-grained features from the reference image into the latent feature maps of the main diffusion model using a secondary U-Net encoder and affine transformations. Also utilizes perceptual loss for improved feature alignment.	DTC surpasses existing zero-shot diffusion inpainting methods like Paint By Example. DTC matches the performance of few-shot diffusion personalization algorithms like DreamPaint while enabling real-time inference. DTC effectively handles in-the-wild images and references, preserves fine-grained product details, and ensures seamless integration into target scenes.	DTC might struggle with very fine-grained details, particularly text engravings due to limitations of VAE decoder. Model might alter human poses due to its pose-agnostic nature, potentially causing discrepancies in full-body coverage.	diffusion models, image inpainting, virtual try-on, e-commerce, computer vision
2401.13641 Report	How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability	Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia	Large Language Models (LLMs) such as GPT developed by OpenAI, have already shown astonishing results, introducing quick changes in our society. This has been intensified by the release of ChatGPT which allows anyone to interact in a simple conversational way with LLMs, without any experience in the field needed. As a result, ChatGPT has been rapidly applied to many different tasks such as code- and song-writer, education, virtual assistants, etc., showing impressive results for tasks for which it was not trained (zero-shot learning). The present study aims to explore the ability of ChatGPT, based on the recent GPT-4 multimodal LLM, for the task of face biometrics. In particular, we analyze the ability of ChatGPT to perform tasks such as face verification, soft-biometrics estimation, and explainability of the results. ChatGPT could be very valuable to further increase the explainability and transparency of automatic decisions in human scenarios. Experiments are carried out in order to evaluate the performance and robustness of ChatGPT, using popular public benchmarks and comparing the results with state-of-the-art methods in the field. The results achieved in this study show the potential of LLMs such as ChatGPT for face biometrics, especially to enhance explainability. For reproducibility reasons, we release all the code in GitHub.	This paper presents the first study exploring the capabilities of ChatGPT, specifically the GPT-4 multimodal LLM, for face biometrics tasks including face verification, soft biometrics estimation, and result explainability.	ChatGPT's rapid adoption and impressive zero-shot learning capabilities make it important to assess its potential in face biometrics, a field crucial for security and human-computer interaction.	The study uses ChatGPT's API with specifically designed prompts to perform face verification and soft biometrics estimation on various benchmark databases, comparing its performance with state-of-the-art models. The explainability of ChatGPT’s outputs is analyzed qualitatively.	ChatGPT demonstrates promising results for face verification in controlled environments, but its performance declines in challenging scenarios such as surveillance or extreme conditions. ChatGPT shows potential for soft biometrics estimation, outperforming some specialized models on certain attributes like age and ethnicity in LFW, and gender in MAAD-Face. ChatGPT exhibits the ability to provide textual explanations for its decisions, enhancing the transparency of its outputs, despite occasional inaccuracies.	The study is limited by the computational cost and API request limitations of ChatGPT, restricting the number of experiments. Further research is needed to explore bias mitigation techniques in ChatGPT for fairer face biometrics applications.	large language models, chatgpt, face recognition, soft biometrics, explainability
2401.13627 Report	Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild	Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong	We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution, high-quality images for model training, each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts, broadening its application scope and potential. Moreover, we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.	This paper proposes SUPIR, the largest-ever image restoration method, achieving high-fidelity and intelligent restoration through model scaling, a novel adaptor, a large image-text dataset, and restoration-guided sampling.	Existing IR methods are limited by the scale of generative models and often lack the intelligence for targeted restoration. Model scaling significantly enhances model capability, pushing the boundaries of image restoration quality and intelligence.	The authors utilize the StableDiffusion-XL as the generative prior and design a large-scale adaptor with a ZeroSFT connector. They collect 20 million high-resolution images with text annotations for training and introduce negative-quality samples/prompts for quality enhancement. A restoration-guided sampling method is developed to ensure fidelity.	SUPIR achieves state-of-the-art performance on non-reference assessment metrics, indicating superior perceptual quality. It offers flexible control over restoration through textual prompts, enabling targeted restoration and manipulation. Extensive experiments on both synthetic and real-world data validate the effectiveness and superiority of the method.	Negative prompts might introduce artifacts when low-quality inputs lack semantic clarity. Full-reference metrics show limitations in evaluating high-fidelity restoration, necessitating new evaluation methods.	image restoration, generative prior, model scaling, textual prompt, diffusion models
2401.13601 Report	MM-LLMs: Recent Advances in MultiModal Large Language Models	Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu	In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing $122$ MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.	This paper presents a comprehensive survey of MultiModal Large Language Models (MM-LLMs), focusing on their recent advancements in bridging language models with other modalities.	MM-LLMs represent a significant advancement in AI, striving to combine the reasoning and decision-making capabilities of LLMs with the rich information content of various modalities (e.g., image, video, audio).	The authors provide a detailed analysis of MM-LLM design, encompassing model architecture (with five key components) and training pipelines (including Multimodal Pre-Training and Instruction Tuning).	The paper introduces a taxonomy of 122 SOTA MM-LLMs, categorized by functionality and design. It reviews the performance of major MM-LLMs on 18 VL benchmarks, providing a comparative analysis of their capabilities. The authors distill key training recipes for enhancing MM-LLMs based on insights from state-of-the-art models.	The paper acknowledges the rapidly evolving nature of MM-LLMs and potential omissions, addressed by maintaining a dedicated website for real-time updates. The paper provides concise overviews of individual MM-LLMs due to space limitations, committing to more detailed information on their website.	multimodal learning, large language models, vision-language, multimodal instruction tuning, survey
2401.13560 Report	SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation	Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, Lei Zhu	The Transformer architecture has shown a remarkable ability in modeling global relationships. However, it poses a significant computational challenge when processing high-dimensional medical images. This hinders its development and widespread adoption in this task. Mamba, as a State Space Model (SSM), recently emerged as a notable manner for long-range dependencies in sequential modeling, excelling in natural language processing filed with its remarkable memory efficiency and computational speed. Inspired by its success, we introduce SegMamba, a novel 3D medical image \textbf{Seg}mentation \textbf{Mamba} model, designed to effectively capture long-range dependencies within whole volume features at every scale. Our SegMamba, in contrast to Transformer-based methods, excels in whole volume feature modeling from a state space model standpoint, maintaining superior processing speed, even with volume features at a resolution of {$64\times 64\times 64$}. Comprehensive experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba. The code for SegMamba is available at: https://github.com/ge-xing/SegMamba	This paper introduces SegMamba, a novel 3D medical image segmentation model based on the Mamba architecture for capturing long-range dependencies within whole-volume features efficiently.	Modeling global relationships in 3D medical image segmentation is crucial but computationally challenging. Transformer-based methods, while effective, struggle with high-resolution images. SegMamba addresses this challenge by leveraging the Mamba architecture for memory-efficient and fast long-range dependency modeling.	SegMamba combines a U-shaped structure with the Mamba architecture. It incorporates a tri-orientated Mamba (ToM) module for multi-directional feature modeling and a gated spatial convolution (GSC) module to enhance spatial feature representation.	SegMamba achieves state-of-the-art performance on BraTS2023, AIIB2023, and the newly proposed CRC-500 datasets. Ablation studies demonstrate the effectiveness of the GSC and ToM modules in improving segmentation accuracy. SegMamba exhibits superior computational efficiency compared to transformer-based methods, even with high-resolution input.	The paper acknowledges potential limitations in evaluating the generalizability of SegMamba due to the limited number of datasets used. Future work may explore extending SegMamba to multi-modal medical image segmentation. Investigating the integration of alternative spatial feature extraction modules within the SegMamba framework.	3d medical image segmentation, state space models, mamba, long-range dependencies, computational efficiency
2401.13555 Report	Benchmarking the Fairness of Image Upsampling Methods	Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer	Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.	The paper introduces a comprehensive framework for benchmarking the performance and fairness of conditional generative models, focusing on image upsampling.	Assessing the fairness of generative models is crucial to mitigate potential biases in applications like image enhancement, which can have societal impacts.	The authors propose novel fairness metrics (RDP, PR, UCPR) inspired by supervised fairness counterparts, alongside traditional performance measures. They create a benchmark using a subset of the FairFace dataset, called UnfairFace, mimicking racial distribution biases in common datasets.	Training data bias significantly affects the fairness of image upsampling models across all races. Denoising Diffusion Restoration Models (DDRM) show the most significant fairness discrepancies between biased and unbiased datasets. While some models demonstrate better fairness, statistical tests reveal that none achieve statistically significant fairness, emphasizing the need for further research.	The evaluation is limited to 128x128 resolution images due to the lack of fairness labels in higher-resolution datasets. The definition of fairness relies on race labels, which are inherently complex and subject to limitations in representation and granularity.	conditional generative models, computer vision, image upsampling, fairness, dataset bias
2401.13388 Report	UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion	Wei Li, Xue Xu, Jiachen Liu, Xinyan Xiao	Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.	This paper introduces UNIMO-G, a novel multimodal conditional diffusion framework for image generation using interleaved textual and visual prompts.	Existing text-to-image models struggle to generate images with intricate details due to the limitations of concise textual descriptions. UNIMO-G addresses this by enabling more control and detail through multimodal prompts.	UNIMO-G leverages a Multimodal Large Language Model (MLLM) to encode multimodal prompts and a conditional denoising diffusion network for image generation. It is trained in two stages: pre-training on text-image pairs for basic generation and fine-tuning with multimodal prompts for enhanced controllability.	UNIMO-G outperforms existing VL-to-image models in text-to-image generation on MS-COCO. It excels in zero-shot single-entity subject-driven generation, achieving state-of-the-art results on DreamBench. UNIMO-G exhibits superior performance in zero-shot multi-entity subject-driven generation, as demonstrated on the newly introduced MultiBench.	UNIMO-G shares common limitations with other image generation models, such as occasional inaccuracies in complex compositions and limitations in visual faithfulness. The potential for misuse, particularly in creating deepfakes, raises ethical concerns.	multimodal image generation, diffusion models, multimodal large language models, subject-driven generation, zero-shot learning
2401.13363 Report	Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons	Zhe Xu, Kun Wei, Xu Yang, Cheng Deng	Human dance generation (HDG) aims to synthesize realistic videos from images and sequences of driving poses. Despite great success, existing methods are limited to generating videos of a single person with specific backgrounds, while the generalizability for real-world scenarios with multiple persons and complex backgrounds remains unclear. To systematically measure the generalizability of HDG models, we introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG). Evaluating the state-of-the-art methods on cHDG, we empirically find that they fail to generalize to real-world scenarios. To tackle the issue, we propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses. Specifically, in contrast to straightforward DDIM or null-text inversion, we first present a pose-aware inversion method to obtain the noisy latent code and initialization text embeddings, which can accurately reconstruct the composed reference image. Since directly generating videos from them will lead to severe appearance inconsistency, we propose a compositional augmentation strategy to generate augmented images and utilize them to optimize a set of generalizable text embeddings. In addition, consistency-guided sampling is elaborated to encourage the background and keypoints of the estimated clean image at each reverse step to be close to those of the reference image, further improving the temporal consistency of generated videos. Extensive qualitative and quantitative results demonstrate the effectiveness and superiority of our approach.	This paper introduces a novel dataset for compositional human dance generation (cHDG) and proposes a new zero-shot method for this task that leverages text embeddings optimized on augmented data.	cHDG is a challenging task with no previous work, making this research significant for advancing the field.	The proposed method utilizes a pretrained Stable Diffusion model and optimizes text embeddings on augmented data with varying numbers of people and backgrounds. This allows the model to learn generalizable representations for cHDG.	The proposed method achieves state-of-the-art performance on cHDG benchmarks, outperforming both supervised and zero-shot baselines in terms of temporal consistency and pose accuracy. The approach demonstrates superior performance in a user study, indicating higher overall generation quality. The method is efficient in terms of storage, requiring only optimized text embeddings instead of storing entire models.	The study primarily focuses on a limited set of 10 persons, 10 backgrounds, and 10 pose sequences, which might not fully represent the diversity in real-world scenarios. Further investigation is required to explore the impact of a larger and more diverse dataset on the generalizability of the proposed method.	compositional human dance generation, zero-shot learning, text embeddings, stable diffusion, data augmentation
2401.13329 Report	Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval	Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu	Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent developments in vision-language multimodal models pre-trained on large-scale image-text and/or video-text pairs are only based on coarse associations (weakly labelled). They are inadequate to provide fine-grained moment-text correlations required for cross-domain VMR. In this work, we solve the problem of unseen cross-domain VMR, where certain visual and textual concepts do not overlap across domains, by only utilising target domain sentences (text prompts) without accessing their videos. To that end, we explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences, enabling us to simulate target domain videos. We address two problems in video editing for optimising unseen domain VMR: (1) generation of high-quality simulation videos of different moments with subtle distinctions, (2) selection of simulation videos that complement existing source training videos without introducing harmful noise or unnecessary repetitions. On the first problem, we formulate a two-stage video diffusion generation controlled simultaneously by (1) the original video structure of a source video, (2) subject specifics, and (3) a target sentence prompt. This ensures fine-grained variations between video moments. On the second problem, we introduce a hybrid selection mechanism that combines two quantitative metrics for noise filtering and one qualitative metric for leveraging VMR prediction on simulation video selection.	This document provides a template and guidelines for formatting papers submitted to a conference (likely associated with the IEEE Computer Society Press).	It ensures consistency in formatting for publication and provides instructions on handling elements like blind review, citations, and figure placement.	The paper outlines specific formatting requirements for margins, fonts, headings, references, and more. It emphasizes the use of LaTeX and provides code snippets for various formatting needs.	The document clarifies the importance of adhering to a strict 8-page limit for submitted papers. It emphasizes the need for clear, numbered equations and consistent referencing styles. The guide offers detailed instructions on anonymizing submissions for blind review, including handling self-citations and unpublished work.	The document primarily focuses on LaTeX, potentially limiting accessibility for authors using other systems. While detailed, the guide might benefit from visual examples of correctly formatted elements.	latex, academic-writing, paper-formatting, conference-submission, ieee
2401.13307 Report	ChatterBox: Multi-round Multimodal Referring and Grounding	Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye	In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.	This paper introduces a new task called multi-round multimodal referring and grounding (MRG) for instance-level multimodal dialogues and presents a new benchmark and an efficient vision-language model, ChatterBox, to facilitate research in this direction.	A powerful multimodal agent should understand logically related questions and perform basic vision-aware tasks like referring and grounding, which few existing models can do effectively.	The ChatterBox model employs a two-branch architecture, with one branch handling language logic and the other focusing on visual feature extraction and recognition for grounding. A two-stage optimization strategy leverages both the new benchmark data and auxiliary data to enhance the model's stability and instance-level understanding.	ChatterBox outperforms previous models in MRG tasks both quantitatively and qualitatively, showing a better understanding of multi-round dialogues and reasoning. The model effectively performs single-round referring expression and visual grounding tasks, surpassing prior models in benchmark evaluations. Diagnostic studies confirm that the newly collected benchmark data and pronoun replacement during training contribute significantly to the model's improved performance in MRG tasks.	The model's design, while effective for referring and grounding, requires further engineering to support tasks beyond these. Future work could explore training a universal tokenizer for vision-language understanding to enhance the model's capabilities.	multimodal dialogue, referring expression, visual grounding, vision-language model, instance-level understanding
2401.13221 Report	Unified-Width Adaptive Dynamic Network for All-In-One Image Restoration	Yimin Xu, Nanxi Gao, Zhongyun Shan, Fei Chao, Rongrong Ji	In contrast to traditional image restoration methods, all-in-one image restoration techniques are gaining increased attention for their ability to restore images affected by diverse and unknown corruption types and levels. However, contemporary all-in-one image restoration methods omit task-wise difficulties and employ the same networks to reconstruct images afflicted by diverse degradations. This practice leads to an underestimation of the task correlations and suboptimal allocation of computational resources. To elucidate task-wise complexities, we introduce a novel concept positing that intricate image degradation can be represented in terms of elementary degradation. Building upon this foundation, we propose an innovative approach, termed the Unified-Width Adaptive Dynamic Network (U-WADN), consisting of two pivotal components: a Width Adaptive Backbone (WAB) and a Width Selector (WS). The WAB incorporates several nested sub-networks with varying widths, which facilitates the selection of the most apt computations tailored to each task, thereby striking a balance between accuracy and computational efficiency during runtime. For different inputs, the WS automatically selects the most appropriate sub-network width, taking into account both task-specific and sample-specific complexities. Extensive experiments across a variety of image restoration tasks demonstrate that the proposed U-WADN achieves better performance while simultaneously reducing up to 32.3\% of FLOPs and providing approximately 15.7\% real-time acceleration. The code has been made available at \url{https://github.com/xuyimin0926/U-WADN}.	This paper presents a novel Unified-Width Adaptive Dynamic Network (U-WADN) designed for all-in-one image restoration, dynamically allocating computational resources based on both task-specific and sample-specific difficulties.	Current all-in-one image restoration methods treat all degradations equally, leading to suboptimal resource allocation. This paper introduces a method to assess and leverage task-wise complexity for improved efficiency.	The U-WADN uses a Width Adaptive Backbone (WAB) with nested sub-networks of varying widths and a Width Selector (WS) to choose the appropriate sub-network for each sample based on its task and complexity.	U-WADN outperforms state-of-the-art methods in PSNR/SSIM across five image restoration tasks, particularly excelling in complex tasks like dehazing and deraining. It achieves a 32.3% reduction in FLOPs and a 15.7% acceleration in speed compared to the baseline. The proposed method allows for a flexible trade-off between performance and efficiency by adjusting the sparsity target.	The current work focuses on 'noisy-rain-hazy' scenarios; exploring other restoration tasks is left for future work. The selection of the optimal sparsity target is based on empirical analysis; developing a more systematic approach is desirable.	image restoration, all-in-one network, dynamic neural network, resource allocation, task-specific complexity
2401.13203 Report	Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects	Yunfan Zhang, Hong Huang, Zhiwei Xiong, Zhiqi Shen, Guosheng Lin, Hao Wang, Nicholas Vun	Controllable 3D indoor scene synthesis stands at the forefront of technological progress, offering various applications like gaming, film, and augmented/virtual reality. The capability to stylize and de-couple objects within these scenarios is a crucial factor, providing an advanced level of control throughout the editing process. This control extends not just to manipulating geometric attributes like translation and scaling but also includes managing appearances, such as stylization. Current methods for scene stylization are limited to applying styles to the entire scene, without the ability to separate and customize individual objects. Addressing the intricacies of this challenge, we introduce a unique pipeline designed for synthesis 3D indoor scenes. Our approach involves strategically placing objects within the scene, utilizing information from professionally designed bounding boxes. Significantly, our pipeline prioritizes maintaining style consistency across multiple objects within the scene, ensuring a cohesive and visually appealing result aligned with the desired aesthetic. The core strength of our pipeline lies in its ability to generate 3D scenes that are not only visually impressive but also exhibit features like photorealism, multi-view consistency, and diversity. These scenes are crafted in response to various natural language prompts, demonstrating the versatility and adaptability of our model.	This paper proposes a novel 3D indoor scene synthesis pipeline that generates decoupled mesh objects with consistent styles using text prompts or single-view images, allowing for individual object stylization and manipulation.	Controllable 3D indoor scene synthesis is crucial for applications like gaming, film, and VR/AR, and this pipeline offers enhanced control over object stylization and placement within a scene.	The pipeline utilizes mesh representations for objects, employs a cascaded stylization approach for multi-object style consistency, leverages ChatGPT for object placement reasoning based on bounding boxes, and allows for user control over object manipulation within the scene.	The pipeline generates high-fidelity 3D indoor scenes with consistent styles across multiple objects. It outperforms existing methods in terms of visual quality, style consistency, and user control, as demonstrated through qualitative and quantitative comparisons and user studies. The decoupled mesh representation enables flexible object manipulation and scene editing capabilities.	Further exploration of style supervision from the whole scene is needed. Incorporating optimization algorithms for object arrangement, such as LEGO-Net, could enhance scene composition.	3d scene synthesis, style transfer, mesh generation, text-to-3d, indoor scene understanding
2401.13011 Report	CCA: Collaborative Competitive Agents for Image Editing	Tiankai Hang, Shuyang Gu, Dong Chen, Xin Geng, Baining Guo	This paper presents a novel generative model, Collaborative Competitive Agents (CCA), which leverages the capabilities of multiple Large Language Models (LLMs) based agents to execute complex tasks. Drawing inspiration from Generative Adversarial Networks (GANs), the CCA system employs two equal-status generator agents and a discriminator agent. The generators independently process user instructions and generate results, while the discriminator evaluates the outputs, and provides feedback for the generator agents to further reflect and improve the generation results. Unlike the previous generative model, our system can obtain the intermediate steps of generation. This allows each generator agent to learn from other successful executions due to its transparency, enabling a collaborative competition that enhances the quality and robustness of the system's results. The primary focus of this study is image editing, demonstrating the CCA's ability to handle intricate instructions robustly. The paper's main contributions include the introduction of a multi-agent-based generative model with controllable intermediate steps and iterative optimization, a detailed examination of agent relationships, and comprehensive experiments on image editing. Code is available at \href{https://github.com/TiankaiHang/CCA}{https://github.com/TiankaiHang/CCA}.	This paper introduces Collaborative Competitive Agents (CCA), a novel generative model leveraging multiple Large Language Models (LLMs) as agents to perform complex tasks, particularly image editing.	Existing generative models struggle with complex, compound tasks and lack transparency in the generation process, hindering learning from other models. CCA addresses these challenges.	Inspired by GANs, CCA uses two generator agents and one discriminator agent. Generators process instructions and produce results, while the discriminator evaluates and provides feedback. This process iterates until satisfactory results are achieved.	CCA demonstrates robust handling of intricate image editing instructions, outperforming previous methods. The study highlights the importance of collaboration and competition among agents for improved results. A hierarchical tool configuration enables effective tool utilization by the agents.	The current implementation primarily focuses on image editing, with potential for broader applications. Future work can explore optimizing agent communication and feedback mechanisms for enhanced efficiency.	generative models, multi-agent systems, large language models, image editing, collaboration and competition
2401.12979 Report	GALA: Generating Animatable Layered Assets from a Single Scan	Taeksoo Kim, Byungjun Kim, Shunsuke Saito, Hanbyul Joo	We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.	GALA decomposes a single-layer clothed 3D human scan into complete multi-layered 3D assets, enabling 3D garment transfer and avatar customization in any pose.	Existing 3D human reconstruction methods often produce single-layer meshes, limiting their use in applications like virtual try-on or avatar customization that require layered and animatable assets.	The method leverages a pre-trained 2D diffusion model as a geometry and appearance prior. It separates the input mesh using multi-view 2D segmentation and synthesizes missing geometry in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Texture inpainting using SDS completes the appearance.	Outperforms state-of-the-art text-driven 3D editing methods in decomposition tasks. Enables robust canonicalization of clothed humans from a single scan, surpassing existing methods. Successfully transfers garments and reposes decomposed assets to create novel, animatable avatars.	Currently generates a static canonical shape, limiting the accurate reposing of loose clothing. Relies on accurate 2D segmentation, which can be a bottleneck.	3d garment transfer, avatar customization, 3d decomposition, score distillation sampling, diffusion models
2401.12978 Report	Zero-Shot Learning for the Primitives of 3D Affordance in General Objects	Hyeonwoo Kim, Sookwan Han, Patrick Kwon, Hanbyul Joo	One of the major challenges in AI is teaching machines to precisely respond and utilize environmental functionalities, thereby achieving the affordance awareness that humans possess. Despite its importance, the field has been lagging in terms of learning, especially in 3D, as annotating affordance accompanies a laborious process due to the numerous variations of human-object interaction. The low availability of affordance data limits the learning in terms of generalization for object categories, and also simplifies the representation of affordance, capturing only a fraction of the affordance. To overcome these challenges, we propose a novel, self-supervised method to generate the 3D affordance examples given only a 3D object, without any manual annotations. The method starts by capturing the 3D object into images and creating 2D affordance images by inserting humans into the image via inpainting diffusion models, where we present the Adaptive Mask algorithm to enable human insertion without altering the original details of the object. The method consequently lifts inserted humans back to 3D to create 3D human-object pairs, where the depth ambiguity is resolved within a depth optimization framework that utilizes pre-generated human postures from multiple viewpoints. We also provide a novel affordance representation defined on relative orientations and proximity between dense human and object points, that can be easily aggregated from any 3D HOI datasets. The proposed representation serves as a primitive that can be manifested to conventional affordance representations via simple transformations, ranging from physically exerted affordances to nonphysical ones. We demonstrate the efficacy of our method and representation by generating the 3D affordance samples and deriving high-quality affordance examples from the representation, including contact, orientation, and spatial occupancies.	This paper introduces a novel self-supervised method for generating 3D affordance examples and a new primitive representation for 3D affordance, enabling zero-shot learning of object functionality from 3D objects.	Current affordance learning methods struggle with generalization to diverse interactions and limited data availability. This work aims to overcome these challenges by generating affordance data without manual annotation and utilizing a richer representation.	The method generates 2D affordance examples by inserting humans into object renderings using inpainting diffusion models with a novel Adaptive Mask algorithm. These 2D examples are then lifted to 3D using human pose estimation and depth optimization. A new affordance representation based on relative orientations and proximity between human and object points is proposed.	Adaptive Mask Inpainting preserves original object details during human insertion, leading to more realistic affordance examples. Depth optimization using multiview cues significantly improves the quality of 3D affordance samples. The proposed primitive representation can effectively derive various affordance cues like contact, orientation tendency, and spatial occupancy.	The method might exhibit spatial bias inherited from the inpainting diffusion models. Modeling dexterous interactions, like grasping, remains challenging due to limitations in diffusion and 3D human prediction models.	affordance learning, zero-shot learning, 3d vision, human-object interaction, diffusion models
2401.12945 Report	Lumiere: A Space-Time Diffusion Model for Video Generation	Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri	We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.	Introduces Lumiere, a text-to-video diffusion model that synthesizes videos with realistic, diverse, and coherent motion by generating the entire temporal duration at once using a Space-Time U-Net (STUNet) architecture.	Addresses the limitations of existing video models that rely on temporal super-resolution, which hinders global temporal consistency and realistic motion generation.	Employs a STUNet that downsamples in both space and time, processes information in a compact representation, and leverages a pre-trained text-to-image diffusion model. It utilizes Multidiffusion for temporally consistent spatial super-resolution.	Achieves state-of-the-art text-to-video generation with superior motion quality. Facilitates various video content creation tasks like image-to-video, video inpainting, and stylized generation. Demonstrates consistent video editing capabilities using off-the-shelf editing methods like SDEdit.	Limited to generating single-shot videos without scene transitions. Relies on a pixel-space T2I model, necessitating a spatial super-resolution module.	text-to-video generation, diffusion models, space-time u-net, video inpainting, stylized video generation
2401.12915 Report	Red Teaming Visual Language Models	Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu	VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language Models) to accept multimodal inputs. Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). Our RTVLM is the first red-teaming dataset to benchmark current VLMs in terms of these 4 different aspects. Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V. Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM test set, 13% in MM-Hal, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models with regular alignment data. This reveals that current open-sourced VLMs still lack red teaming alignment. Our code and datasets will be open-source.	This paper introduces RTVLM, the first red teaming dataset for vision-language models (VLMs) focusing on vulnerabilities in image-text understanding.	VLMs, combining text and image processing, raise safety and ethical concerns, requiring a systematic benchmark like RTVLM for evaluation and improvement.	RTVLM comprises 5,200 image-question pairs across 10 subtasks under faithfulness, privacy, safety, and fairness categories, annotated by humans and GPT-4.	Open-sourced VLMs significantly lag behind GPT-4V in handling red teaming scenarios, showing up to a 31% performance gap. VLMs are particularly susceptible to misleading information presented through images. Current VLMs lack adequate alignment for red teaming, highlighting the need for dedicated training data.	The current version of RTVLM primarily focuses on English-based question-image pairs. Future work should explore more complex and subtle red teaming scenarios.	vision-language models, red teaming, benchmarking, safety, fairness
2401.12902 Report	Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?	Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, Dongfang Liu	As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.	This paper investigates when and why visual prompt tuning (VPT) outperforms full finetuning (FT) in transfer learning for vision tasks.	Understanding the conditions favoring VPT over traditional FT is crucial for efficient transfer learning in large-scale vision models.	The authors conduct experiments on 19 datasets from VTAB-1k, analyzing the impact of task objectives, data distributions, and dataset size on the performance of VPT and FT.	VPT is preferred when there's a large disparity between original and downstream task objectives or high similarity in data distributions, especially with limited data. Overfitting doesn't fully explain VPT's success, and additional parameters alone don't guarantee better optimization. Preserving original features while adding task-specific parameters is crucial for VPT's effectiveness.	The study focuses on image classification, limiting generalizability to other vision tasks. Further exploration of visual explanations for VPT's advantage is needed.	visual prompt tuning, full finetuning, transfer learning, vision models, parameter efficiency
2401.12900 Report	PSAvatar: A Point-based Morphable Shape Model for Real-Time Head Avatar Animation with 3D Gaussian Splatting	Zhongyuan Zhao, Zhenyu Bao, Qing Li, Guoping Qiu, Kanglin Liu	Despite much progress, achieving real-time high-fidelity head avatar animation is still difficult and existing methods have to trade-off between speed and quality. 3DMM based methods often fail to model non-facial structures such as eyeglasses and hairstyles, while neural implicit models suffer from deformation inflexibility and rendering inefficiency. Although 3D Gaussian has been demonstrated to possess promising capability for geometry representation and radiance field reconstruction, applying 3D Gaussian in head avatar creation remains a major challenge since it is difficult for 3D Gaussian to model the head shape variations caused by changing poses and expressions. In this paper, we introduce PSAvatar, a novel framework for animatable head avatar creation that utilizes discrete geometric primitive to create a parametric morphable shape model and employs 3D Gaussian for fine detail representation and high fidelity rendering. The parametric morphable shape model is a Point-based Morphable Shape Model (PMSM) which uses points instead of meshes for 3D representation to achieve enhanced representation flexibility. The PMSM first converts the FLAME mesh to points by sampling on the surfaces as well as off the meshes to enable the reconstruction of not only surface-like structures but also complex geometries such as eyeglasses and hairstyles. By aligning these points with the head shape in an analysis-by-synthesis manner, the PMSM makes it possible to utilize 3D Gaussian for fine detail representation and appearance modeling, thus enabling the creation of high-fidelity avatars. We show that PSAvatar can reconstruct high-fidelity head avatars of a variety of subjects and the avatars can be animated in real-time ($\ge$ 25 fps at a resolution of 512 $\times$ 512 ).	PSAvatar, a novel framework for creating animatable head avatars that combines a point-based morphable shape model (PMSM) with 3D Gaussian representation.	Achieving real-time high-fidelity head avatar animation is challenging due to trade-offs between speed and quality in existing methods. This method aims to overcome limitations in modeling non-facial features and improve rendering efficiency.	A PMSM is built upon FLAME to model shape variations from pose and expressions, utilizing points for flexible 3D representation. Then, 3D Gaussians are employed for fine detail representation and appearance modeling during rendering.	PSAvatar reconstructs high-fidelity head avatars, accurately capturing complex geometries like hair strands and eyeglasses. The method enables real-time animation of the avatars (≥ 25 fps at 512 × 512 resolution). Quantitative and qualitative evaluations demonstrate superior performance compared to state-of-the-art methods like IMAvatar, INSTA, and PointAvatar.	The reliance on FLAME for initialization may limit the reconstruction of highly unstructured hairstyles. Future work could explore personalized PMSM initialization to improve representation capability further.	head avatar, 3d gaussian, point-based morphable shape model, real-time animation, high-fidelity rendering
2401.12596 Report	UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation	Hengjia Li, Yang Liu, Yuqi Lin, Zhanwei Zhang, Yibo Zhao, weihang Pan, Tu Zheng, Zheng Yang, Yuchun Jiang, Boxi Wu, Deng Cai	Recently, generative domain adaptation has achieved remarkable progress, enabling us to adapt a pre-trained generator to a new target domain. However, existing methods simply adapt the generator to a single target domain and are limited to a single modality, either text-driven or image-driven. Moreover, they cannot maintain well consistency with the source domain, which impedes the inheritance of the diversity. In this paper, we propose UniHDA, a \textbf{unified} and \textbf{versatile} framework for generative hybrid domain adaptation with multi-modal references from multiple domains. We use CLIP encoder to project multi-modal references into a unified embedding space and then linearly interpolate the direction vectors from multiple target domains to achieve hybrid domain adaptation. To ensure \textbf{consistency} with the source domain, we propose a novel cross-domain spatial structure (CSS) loss that maintains detailed spatial structure information between source and target generator. Experiments show that the adapted generator can synthesise realistic images with various attribute compositions. Additionally, our framework is generator-agnostic and versatile to multiple generators, e.g., StyleGAN, EG3D, and Diffusion Models.	This paper introduces UniHDA, a unified and versatile framework for multi-modal hybrid domain adaptation in generative models.	Existing methods are limited to adapting to a single target domain and modality, often overfitting to domain-specific attributes and failing to maintain consistency with the source domain. UniHDA addresses these limitations by enabling adaptation to hybrid domains with multi-modal references (text and image) while preserving source domain diversity.	UniHDA leverages CLIP to project multi-modal references into a unified embedding space. It then linearly interpolates direction vectors in this space to achieve hybrid domain adaptation. To maintain consistency, UniHDA introduces a cross-domain spatial structure loss that preserves detailed spatial information between source and target generators.	UniHDA successfully adapts pre-trained generators (StyleGAN, Diffusion models, EG3D) to hybrid domains, synthesizing realistic images with integrated characteristics from multiple domains. It outperforms existing methods in terms of both generation quality (e.g., CLIP Score, Structural Consistency Score) and efficiency (model size and training time). The proposed cross-domain spatial structure loss is shown to be crucial for maintaining consistency and inheriting diversity from the source domain.	UniHDA's reliance on CLIP during training might introduce potential bias for some domains. Future work could focus on eliminating this bias and further exploring multi-modal hybrid domain adaptation.	generative domain adaptation, multi-modal adaptation, hybrid domain adaptation, generative models, clip
2401.12592 Report	RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos	Hongchi Xia, Yang Fu, Sifei Liu, Xiaolong Wang	We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing, the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos, which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks, real-world scale camera poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis, camera pose estimation, object 6d pose estimation, and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is https://wildrgbd.github.io/.	This paper introduces WildRGB-D, a novel large-scale RGB-D object dataset captured in the wild, featuring 8500 tabletop objects across 44 categories in 20K videos with 360-degree views.	Existing real-world object datasets often lack depth information, limiting 3D annotation accuracy and downstream applications. WildRGB-D addresses this gap by providing real-world scale camera poses, object masks, and point clouds, enabling advancements in 3D object learning.	The dataset was created by capturing RGB-D videos of objects using iPhones. Automatic annotations were generated using SLAM algorithms for camera poses and point clouds, and a combination of Grounding-DINO, Segment-Anything, and XMem for object masks.	Depth information in WildRGB-D consistently improves novel view synthesis, especially for generalizable NeRF models. WildRGB-D enables learning generalizable camera pose estimation models that perform well on unseen object categories. The dataset facilitates accurate object surface reconstruction, with depth information significantly boosting performance and SDF-based methods showing superior results.	Current WildRGB-D lacks object 6D pose annotations, which are planned for future crowdsourcing efforts. Further exploration is needed to address the limitations of translation prediction in camera pose estimation observed in the experiments.	rgb-d dataset, object recognition, 3d object learning, novel view synthesis, camera pose estimation
2401.12511 Report	Convolutional Initialization for Data-Efficient Vision Transformers	Jianqiao Zheng, Xueqian Li, Simon Lucey	Training vision transformer networks on small datasets poses challenges. In contrast, convolutional neural networks (CNNs) can achieve state-of-the-art performance by leveraging their architectural inductive bias. In this paper, we investigate whether this inductive bias can be reinterpreted as an initialization bias within a vision transformer network. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs. We introduce a novel initialization strategy for transformer networks that can achieve comparable performance to CNNs on small datasets while preserving its architectural flexibility.	This paper introduces a novel initialization strategy for Vision Transformer (ViT) networks, drawing inspiration from the effectiveness of random impulse filters in Convolutional Neural Networks (CNNs).	ViTs often struggle with small datasets compared to CNNs due to CNNs' inherent architectural inductive bias. This work aims to bridge this performance gap by reinterpreting CNNs' inductive bias as an initialization bias within ViTs, thereby improving their data efficiency.	The authors analyze the performance of various spatial mixing filters in ConvMixer, showing that random impulse filters can achieve competitive results. Based on this, they propose initializing the attention maps of ViTs as random impulse convolution filters. They evaluate different ViT model variations and compare their impulse initialization with random and mimetic initializations.	Random impulse filters are as effective as learned filters in ConvMixer when only channel mixing is learned, as long as linear independence and redundancy in channels are met. Initializing ViT attention maps as random impulse convolution filters significantly improves performance on small datasets like CIFAR-10, CIFAR-100, and SVHN, surpassing both random and mimetic initializations. The proposed impulse initialization also leads to faster convergence compared to other initialization methods.	Determining the optimal scale of self-attention and weight normalization hyperparameters for the initialization process is challenging. Adapting the impulse initialization strategy to the original ViT structure without the proposed modifications requires further investigation.	vision transformer, convolutional neural network, initialization, inductive bias, data efficiency
2401.12503 Report	Small Language Model Meets with Reinforced Vision Vocabulary	Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang	Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI community. However, the relatively large number of parameters (more than 7B) of popular LVLMs makes it difficult to train and deploy on consumer GPUs, discouraging many researchers with limited resources. Imagine how cool it would be to experience all the features of current LVLMs on an old GTX1080ti (our only game card). Accordingly, we present Vary-toy in this report, a small-size Vary along with Qwen-1.8B as the base ``large'' language model. In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality. Specifically, we replace negative samples of natural images with positive sample data driven by object detection in the procedure of generating vision vocabulary, more sufficiently utilizing the capacity of the vocabulary network and enabling it to efficiently encode visual information corresponding to natural objects. For experiments, Vary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on the homepage.	This paper introduces Vary-toy, a small-size Large Vision Language Model (LVLM) based on Qwen-1.8B, designed to be trained and deployed on consumer GPUs while retaining the features of larger LVLMs.	Existing LVLMs often have a large number of parameters, making them difficult to train and deploy on consumer-grade hardware. Vary-toy addresses this issue by providing a smaller model that can be utilized by researchers with limited resources.	The authors propose Vary-tiny+, an improved vision vocabulary generation pipeline that incorporates both dense textual data and natural object location data, enhancing the model's ability to encode visual information. They combine this vocabulary with a 1.8B language model to create Vary-toy.	Vary-toy achieves 65.6% ANLS on DocVQA, comparable to the 7B Qwen-VL-chat. It attains 59.1% accuracy on ChartQA, surpassing the 7B mPLUG-DocOwl. Vary-toy achieves 88.1% accuracy on RefCOCO val, on par with the 7B Qwen-VL-chat.	The generation ability of the 1.8B model is relatively poor and needs to be strengthened. The authors suggest exploring the potential of replacing CLIP by adding a large amount of weakly labeled image caption data during the vision vocabulary generation process.	large vision language models, vision vocabulary, object detection, document ocr, resource-constrained environments
2401.12425 Report	The Neglected Tails in Vision-Language Models	Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong	Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage and 10,000x less training time!	The paper investigates the long-tailed issue in vision-language models (VLMs) and proposes Retrieval-Augmented Learning (REAL) to improve zero-shot recognition.	VLMs, despite their strong capabilities, often exhibit biased performance due to the long-tailed concept distribution in their pretraining data.	The paper uses LLMs to estimate concept frequency in VLM pretraining data and proposes two REAL variants: REAL-Prompt (uses the most frequent concept synonym in prompts) and REAL-Linear (trains a linear classifier on a balanced subset of retrieved pretraining data).	REAL-Prompt outperforms existing prompting methods by simply replacing concept names with their most frequent synonyms. REAL-Linear achieves state-of-the-art zero-shot recognition accuracy, surpassing previous methods while using significantly less storage and training time. REAL improves both head and tail class accuracy and can be combined with existing prompting and retrieval-augmented methods for even better performance.	The concept frequency estimation method's precision and recall cannot be accurately evaluated due to the lack of ground-truth annotations in pretraining data. The frequency estimation relies solely on textual captions and may miss visual concepts present in images but not explicitly mentioned in captions.	vision-language models, zero-shot learning, long-tail distribution, retrieval-augmented learning, prompt engineering
2401.12233 Report	Memorization in Self-Supervised Learning Improves Downstream Generalization	Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch	Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.	This paper proposes \name, a novel framework for defining and analyzing memorization in self-supervised learning (SSL) encoders.	Memorization in SSL is unexplored, and existing definitions from supervised learning rely on labels, making them unsuitable for SSL.	\name leverages data augmentations and alignment, common elements in SSL, to quantify memorization by comparing alignment differences between encoders trained with and without specific data points. Extensive experiments were conducted across various architectures, SSL methods, and datasets.	Significant memorization exists in SSL encoders, especially for atypical data points, similar to observations in supervised learning. SSL methods and architectures exhibit consistent memorization patterns, differing from those in supervised learning. Memorization in SSL encoders is crucial for downstream generalization across diverse tasks and data distributions, highlighting its importance for SSL's success.	The theoretical link between memorization and generalization in SSL needs further investigation. Exploring approaches to mitigate privacy risks associated with memorization in SSL is crucial.	self-supervised learning, memorization, representation learning, generalization, data augmentation
2401.12217 Report	Exploring Simple Open-Vocabulary Semantic Segmentation	Zihang Lai	Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improvement when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research.	\mname{} is a novel open-vocabulary semantic segmentation model that achieves strong performance without relying on existing large image-level alignment models, manually annotated segmentation labels, or custom grouping encoders.	Open-vocabulary semantic segmentation is challenging because it requires assigning accurate semantic labels to each pixel in an image using arbitrary open-vocabulary texts, rather than a fixed set of classes.	\mname{} leverages pseudo-masks generated through self-supervised clustering and language embeddings from noisy web texts to train a MaskFormer model.	Achieves competitive results on Pascal VOC, Pascal Context, and COCO datasets. Demonstrates scalability with data, showing consistent performance improvements with larger datasets. Benefits significantly from self-training, leading to an average improvement of 5.5% mIoU over three datasets.	Performance on segmenting smaller objects could be further improved. Exploration of more advanced pseudo-mask generation techniques could lead to better supervision.	open-vocabulary, semantic segmentation, weakly-supervised learning, pseudo-masks, maskformer
2401.12175 Report	Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM	Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang	Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details. Our approach first uses a single-view LRM model with an enhanced geometry decoder to get the triplane NeRF representation. The novel view renderings from the triplane NeRF provide strong geometry and color prior, from which we generate photo-realistic details for the occluded parts using a diffusion model. The generated multiple views then enable reconstruction with high-quality geometry and appearance, leading to superior overall performance comparing to all existing human reconstruction methods.	Presents Human-LRM, a template-free diffusion-guided model for reconstructing detailed 3D humans from single images.	Existing methods struggle to capture fine details, hallucinate occluded parts realistically, and generalize across diverse datasets. Human-LRM overcomes these limitations.	Uses a three-stage approach: 1) Enhanced LRM predicts coarse geometry and color. 2) Conditional diffusion model generates high-fidelity novel views guided by coarse renderings. 3) Multi-view reconstruction model generates final 3D human using diffused views.	Outperforms previous methods in geometry reconstruction on THuman 2.0, Alloy++, and X-Human datasets. Achieves better appearance reconstruction (PSNR, SSIM, LPIPS) than volumetric methods on THuman 2.0. Exhibits superior generalization to challenging poses compared to SMPL-based methods.	Fine details like facial and hand features are not perfectly captured. Future work includes exploring more powerful representations or refinement techniques.	3d human reconstruction, single-view reconstruction, diffusion models, neural radiance fields, novel view synthesis
2401.12168 Report	SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia	Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/	This paper introduces SpatialVLM, a vision-language model trained on a large-scale synthetic dataset of spatial reasoning visual question answering (VQA) pairs, significantly enhancing the spatial reasoning capabilities of VLMs.	Current VLMs struggle with spatial reasoning tasks crucial for real-world applications like robotics and AR. This research aims to bridge this gap by equipping VLMs with human-like spatial understanding.	The authors develop a pipeline to generate spatial VQA data by leveraging off-the-shelf computer vision models to extract object-centric contexts, lift 2D images to 3D point clouds, and synthesize diverse qualitative and quantitative spatial reasoning questions and answers.	SpatialVLM achieves significantly higher accuracy than baseline VLMs on both qualitative and quantitative spatial reasoning VQA benchmarks. Co-training on spatial VQA data does not degrade the model's performance on general VQA tasks, indicating the potential for VLMs to benefit from such specialized data. The study demonstrates the potential of SpatialVLM for novel applications, including serving as a dense reward annotator in robotics and enabling chain-of-thought reasoning for complex spatial tasks.	The accuracy of SpatialVLM's quantitative spatial reasoning is limited by the accuracy of the underlying depth estimation model used in data generation. The current work primarily focuses on direct spatial reasoning, and future research could explore more complex spatial relations and reasoning tasks.	vision-language models, spatial reasoning, visual question answering, data augmentation, robotics
2401.12051 Report	CloSe: A 3D Clothing Segmentation Dataset and Model	Dimitrije Antić, Garvita Tiwari, Batuhan Ozcomlekci, Riccardo Marin, Gerard Pons-Moll	3D Clothing modeling and datasets play crucial role in the entertainment, animation, and digital fashion industries. Existing work often lacks detailed semantic understanding or uses synthetic datasets, lacking realism and personalization. To address this, we first introduce CloSe-D: a novel large-scale dataset containing 3D clothing segmentation of 3167 scans, covering a range of 18 distinct clothing classes. Additionally, we propose CloSe-Net, the first learning-based 3D clothing segmentation model for fine-grained segmentation from colored point clouds. CloSe-Net uses local point features, body-clothing correlation, and a garment-class and point features-based attention module, improving performance over baselines and prior work. The proposed attention module enables our model to learn appearance and geometry-dependent clothing prior from data. We further validate the efficacy of our approach by successfully segmenting publicly available datasets of people in clothing. We also introduce CloSe-T, a 3D interactive tool for refining segmentation labels. Combining the tool with CloSe-T in a continual learning setup demonstrates improved generalization on real-world data. Dataset, model, and tool can be found at https://virtualhumans.mpi-inf.mpg.de/close3dv24/.	This paper introduces CloSe-D, a large-scale 3D clothing segmentation dataset, and CloSe, a novel 3D clothing segmentation model that predicts fine-grained clothing labels directly from colored point clouds, leveraging human body priors and clothing class-based attention.	Existing 3D clothing datasets often lack detailed semantic understanding or realism, hindering the development of robust methods for comprehending digital clothing. This work addresses this gap by providing a high-quality, fine-grained dataset and a novel model that outperforms prior art.	CloSe-D is created by manually refining segmentation labels of 3D scans using an interactive tool, CloSeTool. The CloSe model incorporates a point cloud encoder (DGCNN), a canonical body encoder based on SMPL, a clothing encoder with a learnable codebook and attention mechanism, and a segmentation decoder. It's trained with cross-entropy loss and refined using continual learning with user feedback from CloSeTool.	CloSe-D contains segmentation labels for ~3000 scans and 18 garment categories, making it the first real-world dataset with such fine-grained detail. CloSe significantly outperforms state-of-the-art part segmentation methods (DGCNN, DeltaConv) and prior 3D clothing segmentation methods (MGN, GIM3D) on various datasets. The interactive tool, CloSeTool, facilitates efficient data annotation and model refinement, enhancing generalization to out-of-distribution datasets.	The current method requires garment class as input, which requires preprocessing. Future work could integrate clothing prediction directly into the network. The continual learning framework could be further explored by integrating more recent strategies, such as EWC, to enhance network generalization.	3d clothing segmentation, dataset, deep learning, computer vision, human-computer interaction
2401.11949 Report	Feature Denoising Diffusion Model for Blind Image Quality Assessment	Xudong Li, Jingyuan Zheng, Runze Hu, Yan Zhang, Ke Li, Yunhang Shen, Xiawu Zheng, Yutao Liu, ShengChuan Zhang, Pingyang Dai, Rongrong Ji	Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks. Currently, deep learning BIQA methods typically depend on using features from high-level tasks for transfer learning. However, the inherent differences between BIQA and these high-level tasks inevitably introduce noise into the quality-aware features. In this paper, we take an initial step towards exploring the diffusion model for feature denoising in BIQA, namely Perceptual Feature Diffusion for IQA (PFD-IQA), which aims to remove noise from quality-aware features. Specifically, (i) We propose a {Perceptual Prior Discovery and Aggregation module to establish two auxiliary tasks to discover potential low-level features in images that are used to aggregate perceptual text conditions for the diffusion model. (ii) We propose a Perceptual Prior-based Feature Refinement strategy, which matches noisy features to predefined denoising trajectories and then performs exact feature denoising based on text conditions. Extensive experiments on eight standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods, i.e., achieving the PLCC values of 0.935 ( vs. 0.905 in KADID) and 0.922 ( vs. 0.894 in LIVEC).	This paper proposes PFD-IQA, a novel BIQA framework that utilizes a diffusion model for the first time to denoise quality-aware features, enhancing their representation for accurate image quality assessment.	Existing deep learning BIQA methods often struggle to accurately assess image quality due to noise and excessive focus on high-level features from pre-trained models. This work addresses the need for effective filtering of quality-irrelevant information from features in BIQA.	PFD-IQA consists of two key modules: (1) Perceptual Prior Discovery and Aggregation (PDA): Uses auxiliary tasks to discover distortion and quality level priors, then aggregates perceptual text embeddings to guide the diffusion model. (2) Perceptual Prior-based Diffusion Refinement (PDR): Employs teacher pseudo-features to predefine denoising trajectories, matches student features to these trajectories using adaptive noise alignment, and refines features through text-conditioned denoising.	PFD-IQA outperforms 14 state-of-the-art BIQA methods on eight benchmark datasets, demonstrating its effectiveness and superiority. Cross-dataset validation shows strong generalization ability of PFD-IQA, achieving best performance on most tested datasets. Qualitative analysis using GradCAM visualizations confirms PFD-IQA's ability to effectively focus on quality degradation areas, unlike competing methods.	The reliance on a pre-trained teacher model might limit the model's performance when presented with out-of-distribution images or distortions. Future work could explore incorporating more diverse and fine-grained perceptual priors to further enhance the model's sensitivity to subtle quality degradations.	blind image quality assessment (biqa), diffusion models, feature denoising, perceptual priors, image quality
2401.11739 Report	EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models	Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim	Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.	This paper presents an unsupervised image segmentor that generates fine-grained segmentation maps solely from the semantic knowledge of a pre-trained diffusion model (Stable Diffusion).	This is important because it investigates the extent to which pre-trained diffusion models understand semantic relations in images, without relying on additional training data like annotations.	The method involves generating low-resolution segmentation maps from semantically meaningful feature maps of the diffusion model and then upscaling them to image resolution by identifying semantic correspondences between pixels and low-resolution masks. This is achieved by analyzing how local changes in low-dimensional feature maps affect pixel values in generated images.	The generated segmentation maps are well-delineated and capture detailed object parts, demonstrating the existence of highly accurate pixel-level semantic knowledge in diffusion models. The method outperforms existing unsupervised semantic segmentation methods on various datasets, especially when evaluated with a modified protocol that better assesses pixel embedding quality. Integrating the framework with annotation-free open-vocabulary segmentation models significantly improves their performance, highlighting the accuracy of the generated segmentation masks.	The framework struggles to segment extremely small objects due to potential information compression in lower-dimensional layers. Feature representations may encode attributes beyond object meanings, leading to over-segmentation of elements like sky and ground.	diffusion models, unsupervised semantic segmentation, open-vocabulary segmentation, stable diffusion, semantic knowledge
2401.11708 Report	Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs	Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui	Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster	This paper presents RPG (Recaption, Plan, Generate), a training-free text-to-image generation/editing framework that leverages multimodal LLMs (MLLMs) to enhance the compositionality of diffusion models.	Existing diffusion models struggle to accurately handle complex prompts involving multiple objects, attributes, and relationships. RPG addresses this limitation by using MLLMs for better prompt understanding and region-wise image generation.	RPG uses MLLMs for: (1) Recaptioning: Decomposing complex prompts into subprompts with detailed descriptions and analyzing image-prompt discrepancies for editing. (2) CoT Planning: Dividing the image into subregions and assigning subprompts to each region. (3) Complementary Regional Diffusion: Independently generating image content for each region based on assigned prompts and merging them to create the final image.	RPG significantly outperforms state-of-the-art text-to-image models (e.g., DALL-E 3, SDXL) on compositional prompts, achieving better attribute binding, numeric accuracy, and complex relationship representation. The hierarchical regional diffusion in RPG allows for increasingly complex image generation by further dividing subregions. RPG is generalizable and compatible with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet).	The performance of RPG is dependent on the capabilities of the chosen MLLM and diffusion model. Future work can explore incorporating more complex modalities as input conditions and extending RPG to more real-world applications.	text-to-image generation, diffusion models, multimodal llms, compositional generation, image editing
2401.11633 Report	Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss	Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, Clinton Fookes	The fusion of vision and language has brought about a transformative shift in computer vision through the emergence of Vision-Language Models (VLMs). However, the resource-intensive nature of existing VLMs poses a significant challenge. We need an accessible method for developing the next generation of VLMs. To address this issue, we propose Zoom-shot, a novel method for transferring the zero-shot capabilities of CLIP to any pre-trained vision encoder. We do this by exploiting the multimodal information (i.e. text and image) present in the CLIP latent space through the use of specifically designed multimodal loss functions. These loss functions are (1) cycle-consistency loss and (2) our novel prompt-guided knowledge distillation loss (PG-KD). PG-KD combines the concept of knowledge distillation with CLIP's zero-shot classification, to capture the interactions between text and image features. With our multimodal losses, we train a $\textbf{linear mapping}$ between the CLIP latent space and the latent space of a pre-trained vision encoder, for only a $\textbf{single epoch}$. Furthermore, Zoom-shot is entirely unsupervised and is trained using $\textbf{unpaired}$ data. We test the zero-shot capabilities of a range of vision encoders augmented as new VLMs, on coarse and fine-grained classification datasets, outperforming the previous state-of-the-art in this problem domain. In our ablations, we find Zoom-shot allows for a trade-off between data and compute during training; and our state-of-the-art results can be obtained by reducing training from 20% to 1% of the ImageNet training data with 20 epochs. All code and models are available on GitHub.	Zoom-shot, a novel method that transfers CLIP's zero-shot capabilities to pre-trained vision encoders by training a linear mapping using multimodal loss functions.	Developing new VLMs from scratch is computationally expensive. Zoom-shot offers an accessible method for augmenting existing vision encoders with zero-shot capabilities, democratizing VLM development.	Zoom-shot uses cycle-consistency loss and a novel prompt-guided knowledge distillation loss (PG-KD) to train a linear mapping between CLIP's latent space and the latent space of a pre-trained vision encoder.	Zoom-shot achieves state-of-the-art zero-shot performance on various datasets, outperforming previous methods like Linear Aligner. Zoom-shot training demonstrates a trade-off between compute and data, enabling effective learning even with limited data. The distribution of training images significantly impacts Zoom-shot performance, highlighting the importance of diverse and representative training data.	Zoom-shot performance on fine-grained datasets still lags behind CLIP, indicating limitations in covering specific latent space regions. Further investigation into optimizing the text subspace within the source latent space could enhance zero-shot performance when mapping CLIP text features.	vision-language models, zero-shot classification, knowledge distillation, cross-modal alignment, clip
2401.11239 Report	Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles	Yanlong Zang, Han Yang, Jiaxu Miao, Yi Yang	Image-based virtual try-on systems,which fit new garments onto human portraits,are gaining research attention.An ideal pipeline should preserve the static features of clothes(like textures and logos)while also generating dynamic elements(e.g.shadows,folds)that adapt to the model's pose and environment.Previous works fail specifically in generating dynamic features,as they preserve the warped in-shop clothes trivially with predicted an alpha mask by composition.To break the dilemma of over-preserving and textures losses,we propose a novel diffusion-based Product-level virtual try-on pipeline,\ie PLTON, which can preserve the fine details of logos and embroideries while producing realistic clothes shading and wrinkles.The main insights are in three folds:1)Adaptive Dynamic Rendering:We take a pre-trained diffusion model as a generative prior and tame it with image features,training a dynamic extractor from scratch to generate dynamic tokens that preserve high-fidelity semantic information. Due to the strong generative power of the diffusion prior,we can generate realistic clothes shadows and wrinkles.2)Static Characteristics Transformation: High-frequency Map(HF-Map)is our fundamental insight for static representation.PLTON first warps in-shop clothes to the target model pose by a traditional warping network,and uses a high-pass filter to extract an HF-Map for preserving static cloth features.The HF-Map is used to generate modulation maps through our static extractor,which are injected into a fixed U-net to synthesize the final result.To enhance retention,a Two-stage Blended Denoising method is proposed to guide the diffusion process for correct spatial layout and color.PLTON is finetuned only with our collected small-size try-on dataset.Extensive quantitative and qualitative experiments on 1024 768 datasets demonstrate the superiority of our framework in mimicking real clothes dynamics.	This paper introduces PLTON, a novel diffusion-based virtual try-on system that excels at preserving static garment details (textures, logos) while realistically rendering dynamic features (shadows, folds) adapted to pose and environment.	Existing virtual try-on methods struggle to balance the preservation of static clothes details with the realistic generation of dynamic features, often leading to unrealistic outputs.	PLTON utilizes a two-stage approach: 1) Adaptive Dynamic Rendering extracts dynamic features from input clothes and uses them to guide a pre-trained diffusion model. 2) Static Characteristics Transformation extracts static features from a high-frequency map of the warped garment and injects them into the diffusion model to ensure their preservation.	PLTON generates more realistic and visually appealing virtual try-on results than state-of-the-art methods. The method demonstrates robustness to inaccurate human parsing and suboptimal garment warping. PLTON achieves state-of-the-art quantitative results on high-resolution datasets, as measured by FID and LPIPS metrics.	The reliance on CLIP input size limits the resolution of processed clothing images, potentially leading to information loss. Future work could explore alternative solutions to address the resolution limitation and further enhance the preservation of fine details.	virtual try-on, diffusion models, deep learning, computer vision, fashion
2401.11115 Report	MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation	Nhat M. Hoang, Kehong Gong, Chuan Guo, Michael Bi Mi	Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial $T-T^$ steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last $T^$ steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. Project page: https://nhathoang2002.github.io/MotionMix-page/	This paper presents MotionMix, a weakly-supervised diffusion model for controllable 3D human motion generation that leverages both noisy annotated and clean unannotated motion sequences.	Current diffusion models for motion generation rely on high-quality annotated motion data, which is expensive and time-consuming to obtain. MotionMix addresses this by effectively utilizing more accessible noisy and unannotated data.	MotionMix employs a two-stage denoising process. It first generates rough motion approximations guided by conditions using noisy data, then refines them using clean unannotated data in a later stage.	MotionMix achieves state-of-the-art performance on text-to-motion, action-to-motion, and music-to-dance tasks despite being trained on weakly-supervised data. The method demonstrates robustness to different noisy data ratios and noise injection levels. Experiments show MotionMix can even surpass the performance of fully supervised models trained on perfect data.	Performance on smaller datasets might be slightly worse than fully supervised methods. The denoising pivot, while robust within a range, requires careful tuning for optimal results.	motion generation, diffusion models, weakly-supervised learning, 3d human motion, data efficiency
2401.11078 Report	UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures	Mingyuan Zhou, Rakib Hyder, Ziwei Xuan, Guojun Qi	Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss, combined with a differentiable renderer and text condition, to guide a diffusion model in generating 3D avatars. However, SDS often generates oversmoothed results with few facial details, thereby lacking the diversity compared with ancestral sampling. On the other hand, other works generate 3D avatar from a single image, where the challenges of unwanted lighting effects, perspective views, and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper, we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry, and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end, the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method, outperforming the state-of-the-art methods by a large margin in the experiments.	Presents UltrAvatar, a novel 3D avatar generation approach that enhances fidelity of geometry and quality of physically based rendering (PBR) textures without unwanted lighting.	Addresses limitations of existing methods that struggle with unwanted lighting effects, perspective views, and inferior image quality in single-image 3D avatar generation.	Introduces a diffuse color extraction (DCE) model to remove lighting effects and an authenticity guided texture diffusion model (AGT-DM) to generate high-quality, aligned PBR textures.	UltrAvatar generates high-quality, diverse 3D avatars with true colors and sharp details. Outperforms state-of-the-art methods in text-to-avatar and image-to-avatar generation based on FID, KID, and CLIP Score metrics. Demonstrates superior performance in qualitative evaluation using GPT-4V for photo-realism, artifact minimization, and text-prompt alignment.	Relies on accurate face parsing for optimal DCE model performance. Limited control over specific facial features during generation.	3d avatar generation, diffuse color extraction, texture diffusion model, photometric guidance, edge guidance
2401.11067 Report	Make-A-Shape: a Ten-Million-scale 3D Shape Model	Ka-Hei Hui, Aditya Sanghi, Arianna Rampini, Kamal Rahimi Malekshan, Zhengzhe Liu, Hooman Shayani, Chi-Wing Fu	Significant progress has been made in training large generative models for natural language and images. Yet, the advancement of 3D generative models is hindered by their substantial resource demands for training, along with inefficient, non-compact, and less expressive representations. This paper introduces Make-A-Shape, a new 3D generative model designed for efficient training on a vast scale, capable of utilizing 10 millions publicly-available shapes. Technical-wise, we first innovate a wavelet-tree representation to compactly encode shapes by formulating the subband coefficient filtering scheme to efficiently exploit coefficient relations. We then make the representation generatable by a diffusion model by devising the subband coefficients packing scheme to layout the representation in a low-resolution grid. Further, we derive the subband adaptive training strategy to train our model to effectively learn to generate coarse and detail wavelet coefficients. Last, we extend our framework to be controlled by additional input conditions to enable it to generate shapes from assorted modalities, e.g., single/multi-view images, point clouds, and low-resolution voxels. In our extensive set of experiments, we demonstrate various applications, such as unconditional generation, shape completion, and conditional generation on a wide range of modalities. Our approach not only surpasses the state of the art in delivering high-quality results but also efficiently generates shapes within a few seconds, often achieving this in just 2 seconds for most conditions.	This paper introduces ickname, a novel 3D generative model trained on a massive dataset of over 10 million publicly available 3D shapes. ickname can generate high-quality 3D shapes in just 2 seconds.	Existing 3D generative models lag behind their 2D counterparts due to high resource demands, inefficient representations, and limitations in capturing shape complexity. ickname addresses these challenges, enabling efficient large-scale training and high-quality 3D shape generation.	The paper introduces: (i) a compact and expressive wavelet-tree representation for 3D shapes, (ii) a subband coefficient packing scheme for making the representation compatible with diffusion models, and (iii) a subband adaptive training strategy for effectively learning both coarse and detail wavelet coefficients.	ickname consistently outperforms existing state-of-the-art methods in image-to-3D generation tasks, demonstrating superior quality in terms of both global structure and local details. ickname exhibits robustness to the sparsity of input point clouds, generating high-quality shapes even with limited point information. The proposed wavelet-tree representation and adaptive training strategy are crucial for achieving high-quality generation, surpassing baselines that rely on only coarse shape information or simple loss functions.	The model currently lacks a mechanism to ensure balanced representation across different object categories, leading to potential biases in the generated shapes. While the model excels in generating geometry, incorporating texture generation without relying on computationally expensive optimizations remains an open challenge.	3d generative model, diffusion model, wavelet representation, large-scale 3d shape generation, conditional 3d shape generation
2401.10891 Report	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data	Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao	This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.	This paper introduces Depth Anything, a highly practical model for robust monocular depth estimation that leverages the power of large-scale unlabeled data.	A foundation model for depth estimation is crucial for various applications like robotics, autonomous driving, and VR, but is currently underexplored due to the difficulty in obtaining large-scale depth datasets.	The authors design a data engine to collect and automatically annotate 62M unlabeled images using a pre-trained depth estimation model. They enhance training by challenging the student model with strongly perturbed unlabeled images and by incorporating semantic priors from a frozen DINOv2 encoder.	Depth Anything exhibits superior zero-shot depth estimation capability compared to MiDaS v3.1 across six diverse datasets. When fine-tuned with metric depth information, it significantly outperforms previous state-of-the-art methods on NYUv2 and KITTI. The pre-trained encoder demonstrates strong performance in semantic segmentation tasks, highlighting its potential as a multi-task encoder.	The current model size is limited to ViT-Large and could benefit from further scaling up to ViT-Giant. Training resolution of 512x512 might be insufficient for real-world applications, and increasing it to 700+ or 1000+ could be beneficial.	monocular depth estimation, foundation model, self-supervised learning, semantic segmentation, zero-shot learning
2401.10889 Report	Synthesizing Moving People with 3D Control	Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman, Alexei A. Efros, Jitendra Malik	In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.	This paper proposes 3DHM, a two-stage diffusion model-based framework for animating a person from a single image to imitate a target 3D motion sequence.	The task of animating a person from a single image to imitate another's actions is challenging and requires a deep understanding of human pose, appearance, and clothing.	3DHM uses a two-stage approach: 1) A diffusion model learns to in-fill unseen regions of a partial texture map extracted from the input image. 2) A second diffusion model renders realistic images from intermediate renderings generated using the complete texture map and 3D poses extracted from the target motion sequence.	3DHM outperforms baselines in terms of frame-wise and video-level generation quality metrics (PSNR, SSIM, FID, LPIPS, L1, FID-VID, FVD). 3DHM demonstrates high pose accuracy, preserving the target motion faithfully. 3DHM generalizes well to unseen human images and motions from various sources, including 3D human videos, YouTube videos, and text input.	The model currently generates frames independently, potentially leading to temporal inconsistencies. Training on larger and more diverse datasets could further enhance the model's ability to reconstruct detailed textures.	human animation, diffusion models, texture inpainting, 3d human pose, motion imitation
2401.10831 Report	Understanding Video Transformers via Universal Concept Discovery	Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G. Derpanis, Pavel Tokmakov	This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.	This paper introduces VTCD, the first concept discovery algorithm specifically designed for interpreting video transformer representations. VTCD identifies high-level, spatiotemporal concepts learned by video transformers and quantifies their importance for model predictions.	Understanding how video transformers process information is crucial for addressing concerns about transparency, fairness, and potential biases in AI systems, particularly as these models are increasingly deployed in real-world applications.	VTCD employs SLIC clustering in the feature space to efficiently generate spatiotemporal tubelet proposals. These tubelets are then clustered using Convex Non-negative Matrix Factorization (CNMF) to identify concepts. To assess concept importance, the authors introduce CRIS, a robust method that masks concepts and measures the impact on model performance.	VTCD successfully discovers human-interpretable spatiotemporal concepts, including object tracking, event detection, and positional cues. The authors discover universal 'Rosetta concepts' shared across diverse video transformer models, revealing common mechanisms such as early-layer spatiotemporal basis representations and late-layer object-centric representations. VTCD enables applications like model pruning for improved efficiency and zero-shot video object segmentation by leveraging the discovered concepts.	The SLIC compactness hyperparameter in VTCD requires manual tuning for different models. Calculating the Rosetta score becomes computationally demanding as the number of models analyzed increases.	concept-based interpretability, video transformers, concept discovery, spatiotemporal reasoning, rosetta concepts
2401.10822 Report	ActAnywhere: Subject-Aware Video Background Generation	Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang	Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.	This paper introduces a novel task of automated subject-aware video background generation and proposes a diffusion-based model called ActAnywhere to address it. ActAnywhere generates coherent video backgrounds that adapt to the motion of a foreground subject, guided by a single condition frame depicting the desired background.	This work offers a valuable tool for the film and VFX industry, enabling faster iteration of ideas and creative storytelling by automatically synthesizing realistic background interactions for acting subjects in diverse scenes, which was previously a tedious and expensive manual process.	The model leverages a latent video diffusion model with cross-frame attention for temporal reasoning. It takes as input a foreground subject segmentation sequence, masks, and a single condition frame to generate a composite video with a hallucinated background.	ActAnywhere generates high-quality videos with realistic subject-background interactions, camera motions, lighting, and shadows. The model demonstrates strong generalization capability, extending to out-of-distribution data including non-human subjects. ActAnywhere exhibits emergent capabilities for general video inpainting and robustness to inaccurate foreground segmentation masks.	The model might fail to correct inaccurate details present in the provided condition frame. Further exploration is needed to address potential biases present in the training data and prevent malicious use.	video generation, diffusion models, video editing, subject-aware synthesis, foreground-background interaction
2401.10404 Report	Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution	Xin Yuan, Jinoo Baek, Keyang Xu, Omer Tov, Hongliang Fei	We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .	Proposed "Inflation with Diffusion", a new method for text-to-video super-resolution that efficiently adapts text prompts to enhance temporal consistency.	Addresses the limitations of existing text-to-video super-resolution methods that struggle with temporal consistency due to inefficient text prompt adaptation.	Leverages a pretrained diffusion model by adding small, learnable vectors to intermediate features, enabling effective text guidance with minimal training data and computational cost.	Achieves state-of-the-art results on text-to-video super-resolution benchmarks. Demonstrates superior temporal consistency compared to previous methods. Offers a computationally efficient approach for text-guided video generation.	Limited diversity in generated video content due to reliance on pretrained models. Further exploration needed for handling more complex text prompts and video content.	text-to-video generation, super-resolution, diffusion models, temporal consistency, video generation
2401.10229 Report	OMG-Seg: Is One Model Good Enough For All Segmentation?	Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy	In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.	OMG-Seg is proposed as a unified segmentation model capable of handling various tasks, including image and video segmentation, open-vocabulary settings, and interactive segmentation, all within a single framework, significantly reducing computational and parameter overhead.	A unified model eliminates task-specific design constraints and allows for knowledge sharing across different segmentation tasks, offering a more versatile and efficient approach.	OMG-Seg utilizes a frozen CLIP visual encoder as the backbone and a shared encoder-decoder transformer architecture with task-specific queries. It employs unified query representation for image/tube masks, labels, IDs, and visual prompts, enabling diverse segmentation tasks within one model.	OMG-Seg achieves competitive performance on image, video, open-vocabulary, and interactive segmentation settings across eight diverse datasets. Joint co-training on multiple datasets leads to improved performance, particularly in video segmentation tasks, and significantly reduces model parameters. The shared decoder design in OMG-Seg proves to be efficient as it aligns optimization objectives, benefiting video datasets with short clips.	The frozen architecture, while enabling open-vocabulary capabilities, may limit performance on specific tasks. Future work involves scaling up the model, incorporating more datasets, and potentially adding a text path for language-driven segmentation tasks.	segmentation, unified model, open vocabulary, interactive segmentation, video segmentation
2401.10228 Report	RAP-SAM: Towards Real-Time All-Purpose Segment Anything	Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang	Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene. We argue that diverse outputs are needed for real applications. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real-time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further. Our code and model are available at https://github.com/xushilin1/RAP-SAM/.	This paper introduces 'all-purpose segmentation', a new real-time segmentation setting encompassing interactive, panoptic, and video segmentation within a single model.	Current vision foundation models often lack real-time capability, and existing real-time segmentation methods focus on single applications, limiting their practicality. All-purpose real-time segmentation addresses these limitations, enabling diverse applications like real-time editing, tracking, and segmentation.	The paper proposes 'RAP-SAM' (Real-Time All-Purpose SAM), featuring an efficient encoder, a unified decoder with pooling-based dynamic convolution, and lightweight decoupled adapters to balance performance across tasks. It leverages joint co-training on COCO and YouTube-VIS datasets.	RAP-SAM achieves the best speed and accuracy trade-off among benchmarked real-time methods on all three segmentation tasks. Joint co-training with image and video data improves video instance segmentation performance. The proposed asymmetric adapter design effectively balances performance for object queries and prompt queries.	Performance balance across image, video, and interactive segmentation requires further improvement. Future work includes model acceleration for edge deployment, exploring diverse knowledge distillation techniques, and incorporating various visual prompts like mask prompts.	real-time segmentation, all-purpose segmentation, interactive segmentation, panoptic segmentation, video instance segmentation
2401.10227 Report	A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting	Wouter Van Gansbeke, Bert De Brabandere	Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting. While not setting a new state-of-the-art, our model's simplicity, generality, and mask completion capability are desirable properties.	This paper presents LDMSeg, a novel approach for panoptic segmentation and mask inpainting using latent diffusion models, building upon Stable Diffusion.	The proposed method simplifies panoptic segmentation by avoiding specialized object detection modules, complex loss functions, and ad-hoc post-processing.	LDMSeg employs a two-stage process: (1) training a shallow autoencoder to project segmentation masks to a latent space and (2) training a diffusion model conditioned on image latents for image-guided mask generation.	LDMSeg effectively generates non-overlapping instance masks, achieving promising panoptic segmentation results. The model demonstrates inherent mask inpainting capabilities, successfully completing sparse segmentation masks. LDMSeg outperforms some general-purpose frameworks while being simpler and more computationally efficient.	The model may miss small objects due to the latent space projection. Inference is slower compared to specialized segmentation models due to the diffusion process. Future work includes exploring higher resolution latents and open-vocabulary detection.	panoptic segmentation, mask inpainting, latent diffusion models, generative models, stable diffusion
2401.10226 Report	Towards Language-Driven Video Inpainting via Multimodal Large Language Models	Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy	We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.	This paper introduces a novel task: language-driven video inpainting, aiming to replace manual mask annotations with natural language instructions.	Current video inpainting methods heavily rely on tedious and time-consuming manual mask annotations, which limits their applicability. This new task leverages the flexibility and richness of natural language for more effective video inpainting.	A new dataset, ROVI, is created containing video, removal expression, and inpainted video triplets. A diffusion-based model (LGVI) is proposed, incorporating temporal attention and a mask decoder. An MLLM-enhanced version, LGVI-I, handles interactive inpainting requests.	LGVI outperforms existing language-driven image editing methods and achieves comparable results to multi-stage video inpainting methods on the referring video inpainting task. LGVI-I, enhanced with an MLLM, shows superior performance on the interactive video inpainting task, effectively handling complex chat-style user requests. The proposed method demonstrates robustness in handling challenging scenarios, such as inpainting multiple or non-existent objects.	The model may struggle with ambiguous language descriptions or complex scenes where precise object identification is difficult. Real-time processing and model scalability for diverse video types and languages are areas for future improvement.	video inpainting, language-driven editing, multimodal learning, diffusion models, large language models
2401.10222 Report	Supervised Fine-tuning in turn Improves Visual Foundation Models	Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan	Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.	This paper proposes ViSFT (Vision Supervised Fine-Tuning), a two-stage method to enhance the representation and generalization of vision foundation models, drawing inspiration from SFT in NLP (e.g., instruction tuning).	Existing methods like RegionCLIP face scalability issues due to lack of large-scale region-level datasets. ViSFT addresses this by leveraging fine-grained SFT to improve vision models after pretraining.	ViSFT uses a two-stage process: 1) Independently train in-domain task heads (detection, segmentation, captioning) on COCO with frozen backbone. 2) Introduce LoRA to the backbone, freeze task heads, and jointly train on all tasks, transferring knowledge to LoRA.	ViSFT improves optical character recognition accuracy by at least 2.5 points. Grounded object identification exhibits an enhancement ranging from 0.3 to 0.6 points, especially for smaller models. ViSFT enhances zero-shot image classification, few-shot learning, image-text retrieval, and visual question answering.	The impact of incorporating more diverse datasets with fine-grained annotations, beyond COCO, remains unexplored. The study primarily focuses on the vision transformer within the CLIP model, and the impact on the text encoder is left for future research.	vision foundation models, supervised fine-tuning, image-text representation learning, multi-task training, lora
2401.10208 Report	MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer	Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai	Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.	This paper proposes MM-Interleaved, an end-to-end generative model for interleaved image-text data that addresses the limitation of fixed visual tokens by using a multi-scale and multi-image feature synchronizer module (MMFS).	Developing generative models for interleaved image-text data (e.g. news, blogs) is important because this format is ubiquitous online and necessitates models to comprehend interleaved sequences to generate images and text.	MM-Interleaved leverages a Visual Foundation Model (VFM) for image tokenization, a Large Language Model (LLM) for multi-modal context feature extraction enhanced by MMFS, and a Diffusion Model (DM) for image generation conditioned on LLM outputs and fine-grained features from MMFS.	MM-Interleaved achieves state-of-the-art results on various multi-modal comprehension benchmarks including image captioning, visual question answering, and visual dialogue. The model demonstrates competitive zero-shot text-to-image generation capabilities compared to existing methods. MM-Interleaved effectively handles segmentation-to-image translation and visual storytelling, showcasing its ability to generate realistic images with precise alignment and maintain semantic consistency in generated image sequences.	The quality and quantity of publicly available interleaved image-text data are currently limited. The model may encounter challenges related to hallucination and potential bias in generated content due to noise in the training data.	interleaved image-text generation, multi-modal feature synchronizer, large language models, diffusion models, visual storytelling
2401.10171 Report	SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild	Andreas Engelhardt, Amit Raj, Mark Boss, Yunzhi Zhang, Abhishek Kar, Yuanzhen Li, Deqing Sun, Ricardo Martin Brualla, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani	We present SHINOBI, an end-to-end framework for the reconstruction of shape, material, and illumination from object images captured with varying lighting, pose, and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape, radiance, and pose. We show that an implicit shape representation based on a multi-resolution hash encoding enables faster and robust shape reconstruction with joint camera alignment optimization that outperforms prior work. Further, to enable the editing of illumination and object reflectance (i.e. material) we jointly optimize BRDF and illumination together with the object's shape. Our method is class-agnostic and works on in-the-wild image collections of objects to produce relightable 3D assets for several use cases such as AR/VR, movies, games, etc. Project page: https://shinobi.aengelhardt.com Video: https://www.youtube.com/watch?v=iFENQ6AcYd8&feature=youtu.be	SHINOBI reconstructs shape, material, and illumination from in-the-wild object images with varying lighting, pose, and background.	It enables the creation of relightable 3D assets from casually captured images for applications in AR/VR, movies, and games.	Uses a multi-resolution hash encoding for shape representation, jointly optimizes camera parameters, and incorporates BRDF optimization with a per-view importance weighting scheme.	Outperforms prior work in view synthesis and relighting quality on the NAVI in-the-wild dataset. Achieves faster optimization, reducing runtime by 3 times compared to SAMURAI. Enables high-frequency detail reconstruction in shape, material, and illumination.	May struggle with highly symmetric objects and extremely specular materials. High-frequency detail reconstruction can be limited in some regions due to misaligned views and illumination representation.	3d reconstruction, neural rendering, inverse rendering, camera pose estimation, brdf estimation
2401.10166 Report	VMamba: Visual State Space Model	Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu	Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have long been the predominant backbone networks for visual representation learning. While ViTs have recently gained prominence over CNNs due to their superior fitting capabilities, their scalability is largely constrained by the quadratic complexity of attention computation. Inspired by the capability of Mamba in efficiently modeling long sequences, we propose VMamba, a generic vision backbone model aiming to reduce the computational complexity to linear while retaining ViTs' advantageous features. To enhance VMamba's adaptability in processing vision data, we introduce the Cross-Scan Module (CSM) to enable 1D selective scanning in 2D image space with global receptive fields. Additionally, we make further improvements in implementation details and architectural designs to enhance VMamba's performance and boost its inference speed. Extensive experimental results demonstrate VMamba's promising performance across various visual perception tasks, highlighting its pronounced advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.	Proposes VMamba, a novel vision backbone network based on State Space Models (SSMs) for efficient visual representation learning, aiming to achieve linear computational complexity while retaining the global receptive fields and dynamic weights of Vision Transformers (ViTs).	Addresses the limitations of ViTs, whose quadratic complexity hinders scalability, and CNNs, which lack global receptive fields and dynamic weights, by introducing an alternative foundation model for efficient visual representation learning.	Introduces the Cross-Scan Module (CSM) to adapt the 1D selective scanning of S6 models to 2D vision data, enabling global receptive fields without increasing complexity. Improves VMamba's efficiency through optimized implementation details and architectural design.	Achieves superior or competitive performance on ImageNet-1K classification compared to benchmark models like ResNet, ViT, and Swin, while maintaining linear computational complexity. Demonstrates strong performance in downstream tasks, achieving competitive results on COCO object detection and ADE20K semantic segmentation. Exhibits remarkable input scaling efficiency, showing linear growth in FLOPs with increasing input size, unlike ViT-based models that show quadratic growth.	The bidirectional scanning pattern exhibits instability during training, requiring further investigation and potential solutions. Future work includes exploring larger-scale VMamba models and extending its application to more diverse vision tasks.	vision backbone, state space models, linear complexity, global receptive field, cross-scan module
2401.10150 Report	Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation	Changgu Chen, Junwei Shu, Lianggangxu Chen, Gaoqi He, Changbo Wang, Yang Li	Recent large-scale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model is a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model. To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional and spatial consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process. Extensive experiments demonstrate our proposed method can control the motion trajectories of objects and generate high-quality videos.	This paper introduces Motion-Zero, a zero-shot framework that allows for bounding-box-trajectory control of object motion within pre-trained video diffusion models.	This addresses the challenge of precisely manipulating object trajectories in generated videos, enabling more control over video generation without extensive training or specialized datasets.	The framework uses an Initial Noise Prior Module (INPM) for position-based prior, applies Spatial Constraints (SC) through attention maps for position accuracy, and employs a Shift Temporal Attention Mechanism (STAM) to maintain motion continuity.	Motion-Zero allows for precise control over object trajectories using user-defined bounding boxes. The framework maintains the generative quality of the underlying pre-trained video diffusion models. Quantitative and qualitative results, including user studies, demonstrate that Motion-Zero outperforms baseline methods and rivals pre-trained models like MotionCtrl in control capabilities.	The generative performance of Motion-Zero is limited by the capabilities of the underlying pre-trained video diffusion model. Currently, the trajectory control lacks semantic interaction with the generated video scene, requiring user-defined paths instead of automated navigation based on scene understanding and prompts.	video diffusion models, motion trajectory control, zero-shot learning, text-to-video generation, controllable video synthesis
2401.10061 Report	DiffusionGPT: LLM-Driven Text-to-Image Generation System	Jie Qin, Jie Wu, Weifeng Chen, Yuxi Ren, Huixia Li, Hefeng Wu, Xuefeng Xiao, Rui Wang, Shilei Wen	Diffusion models have opened up new avenues for the field of image generation, resulting in the proliferation of high-quality models shared on open-source platforms. However, a major challenge persists in current text-to-image systems are often unable to handle diverse inputs, or are limited to single model results. Current unified attempts often fall into two orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate expert model to output. To combine the best of both worlds, we propose DiffusionGPT, which leverages Large Language Models (LLM) to offer a unified generation system capable of seamlessly accommodating various types of prompts and integrating domain-expert models. DiffusionGPT constructs domain-specific Trees for various generative models based on prior knowledge. When provided with an input, the LLM parses the prompt and employs the Trees-of-Thought to guide the selection of an appropriate model, thereby relaxing input constraints and ensuring exceptional performance across diverse domains. Moreover, we introduce Advantage Databases, where the Tree-of-Thought is enriched with human feedback, aligning the model selection process with human preferences. Through extensive experiments and comparisons, we demonstrate the effectiveness of DiffusionGPT, showcasing its potential for pushing the boundaries of image synthesis in diverse domains.	DiffusionGPT, a novel unified image generation system that leverages Large Language Models (LLMs) to handle diverse prompts and integrate domain-expert models for superior image synthesis.	Existing text-to-image systems struggle with diverse prompt types and often provide limited results due to reliance on single models with varying domain expertise.	DiffusionGPT utilizes LLMs to parse prompts, build and search domain-specific model trees (Tree-of-Thought), select optimal models based on human feedback (Advantage Databases), and execute image generation with prompt extension.	DiffusionGPT generates more realistic and semantically aligned images compared to baseline models like SD1.5 and SDXL. Quantitative evaluation using image-reward and aesthetic scores demonstrate significant improvements over baseline models. User studies confirm a strong preference for images generated by DiffusionGPT, highlighting its superior quality and alignment with user intent.	Limited feedback incorporation in LLM optimization for prompt parsing and model selection. Dependence on a finite set of model candidates, limiting the diversity and quality of potential outputs.	image generation, diffusion models, large language models, prompt engineering, human feedback
2401.10039 Report	GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition	Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang	Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).	This paper introduces GPT4Ego, a novel Vision-Language Model (VLM) framework for Zero-Shot Egocentric Action Recognition (ZS-EAR) that prioritizes fine-grained alignment between vision and language.	Existing VLM-based ZS-EAR approaches treat the task as a coarse-grained global video-text matching, leading to suboptimal alignment and limiting performance.	GPT4Ego leverages two key components: 1) Ego-oriented Text Prompting (EgoTP) enhances text-contextual semantics by using ChatGPT to generate diverse textual descriptions from class names. 2) Ego-oriented Visual Parsing (EgoVP) utilizes SAM to parse refined visual concepts from video frames, enhancing vision-contextual semantics.	GPT4Ego significantly outperforms state-of-the-art methods on EK100, EGTEA, and CharadesEgo benchmarks. Both EgoTP and EgoVP individually contribute to performance gains, with their combination leading to the most significant improvement. GPT4Ego effectively captures fine-grained semantic alignment between vision and language, as demonstrated by qualitative analysis.	The current implementation relies on external models like ChatGPT and SAM, limiting its computational efficiency. Future work could explore joint training of the VLM with the text generation and visual parsing modules for improved synergy.	egocentric action recognition, zero-shot learning, vision-language learning, chatgpt, segment anything model (sam)
2401.10005 Report	Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation	Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada	The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.	This paper introduces a novel approach to enhance Large Multi-Modal Models (LMMs) by integrating an explicit Chain-of-Reasoning (CoR) process and the ability to generate clarifying questions during reasoning, aiming for more reliable and interpretable visual content interpretation.	Current LMMs often suffer from hallucination, producing outputs not aligned with the input, and lack the ability to explain their reasoning. This work aims to address these limitations by incorporating explicit reasoning and question-asking capabilities.	The authors create a new dataset containing reasoning steps with uncertainty scores, prompting LLM-generated questions when uncertainty is high. They then develop an LMM with improved region awareness, trained in three stages: image-text alignment, instruction tuning, and CoR fine-tuning.	The model successfully generates explicit reasoning steps and asks relevant questions when encountering uncertainty. Integrating question-asking significantly improves performance on knowledge-based VQA tasks like OK-VQA. Current LMMs still struggle with generating perfectly consistent and coherent long reasoning steps, highlighting an area for future research.	The model's performance on long reasoning steps needs further improvement to match the accuracy of direct answer generation. Future work could explore alternative methods for acquiring external knowledge beyond relying solely on LLMs like GPT-4.	large multi-modal models, chain-of-reasoning, question generation, visual reasoning, explainable ai
2401.09985 Report	WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens	Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, Jiwen Lu	World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.	Introducing WorldDreamer, the first general world model for video generation that effectively learns general world motion and physics from visual data.	Existing world models are limited to specific scenarios, hindering their ability to capture the complexity of general world dynamics crucial for versatile video generation.	WorldDreamer leverages VQGAN for visual tokenization and employs a novel Spatial Temporal Patchwise Transformer (STPT) to predict masked visual tokens. Multi-modal prompts, including text and action embeddings, guide the generation process.	WorldDreamer excels in generating high-fidelity videos across diverse scenarios, including natural scenes and driving environments. It exhibits versatility in various tasks such as text-to-video conversion, image-to-video synthesis, video editing, and action-to-video generation. The model demonstrates significant speed advantages over diffusion-based methods, achieving video generation with considerably fewer iterations.	The model currently operates at a resolution of 256x256, leaving room for improvement in generating higher-resolution videos. Further exploration of more intricate masking strategies could potentially enhance the model's ability to capture and generate complex motions.	video generation, world models, vision transformer, multi-modal learning, generative ai
2401.09865 Report	Improving fine-grained understanding in image-text pre-training	Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrović	We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.	The paper proposes SPARC, a new objective for multimodal pre-training that improves fine-grained understanding in vision-language models.	Existing methods for learning fine-grained visual representations are computationally expensive, unstable, and often rely on pre-trained models, making it difficult to isolate the benefits of fine-grained objectives.	SPARC learns language-grouped vision embeddings by aggregating image patches corresponding to individual words in the caption using a sparse similarity metric. It combines a fine-grained contrastive loss on these embeddings with a global image-text contrastive loss.	SPARC outperforms or matches competing methods on zero-shot image classification across ImageNet and its variants. SPARC achieves superior performance on zero-shot image-to-text and text-to-image retrieval on Flickr30k and MSCOCO datasets. SPARC shows significant improvements on fine-grained localization tasks such as open-vocabulary object detection and zero-shot semantic segmentation.	Exploring different sparsification approaches and leveraging bounding boxes/segmentation masks for learning patch groupings could further improve performance. Further investigation is needed to evaluate SPARC encoders within multimodal foundational models like Flamingo, BLIP, and PALI.	multimodal learning, contrastive learning, vision-language models, fine-grained understanding, image-text retrieval
2401.09861 Report	Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models	Li Sun, Liuan Wang, Jun Sun, Takayuki Okatani	Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced the comprehension of multimedia content, bringing together diverse modalities such as text, images, and videos. However, a critical challenge faced by these models, especially when processing video inputs, is the occurrence of hallucinations - erroneous perceptions or interpretations, particularly at the event level. This study introduces an innovative method to address event-level hallucinations in MLLMs, focusing on specific temporal understanding in video content. Our approach leverages a novel framework that extracts and utilizes event-specific information from both the event query and the provided video to refine MLLMs' response. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. Subsequently, we employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences. Our evaluation, conducted using the Charades-STA dataset, demonstrates a significant reduction in temporal hallucinations and an improvement in the quality of event-related responses. This research not only provides a new perspective in addressing a critical limitation of MLLMs but also contributes a quantitatively measurable method for evaluating MLLMs in the context of temporal-related questions.	This paper introduces a novel framework to mitigate event-level temporal hallucinations in Multimodal Large Language Models (MLLMs) when processing video inputs, improving accuracy in answering temporal event queries.	MLLMs, while proficient in understanding multimedia content, often suffer from hallucinations, particularly in accurately perceiving event timings and sequences in videos, leading to erroneous interpretations.	The proposed method decomposes event queries into iconic actions, uses CLIP and BLIP2 models to predict specific timestamps for these actions, and corrects the MLLM's responses using these timestamps as factual evidence.	Significantly reduces temporal hallucinations in MLLMs' responses to event-related questions. Demonstrates superior performance in predicting event occurrence timestamps compared to baseline MLLMs and random predictions. Shows substantial improvement in predicting the order of multiple events within a video.	The current evaluation is limited to the Charades-STA dataset, potentially limiting the generalizability of the findings. Future work can explore incorporating more sophisticated temporal reasoning mechanisms to further enhance the accuracy of event sequencing.	multimodal large language models, temporal hallucination, video understanding, event sequencing, clip, blip2
2401.09794 Report	Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing	Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo	In the field of image editing, Null-text Inversion (NTI) enables fine-grained editing while preserving the structure of the original image by optimizing null embeddings during the DDIM sampling process. However, the NTI process is time-consuming, taking more than two minutes per image. To address this, we introduce an innovative method that maintains the principles of the NTI while accelerating the image editing process. We propose the WaveOpt-Estimator, which determines the text optimization endpoint based on frequency characteristics. Utilizing wavelet transform analysis to identify the image's frequency characteristics, we can limit text optimization to specific timesteps during the DDIM sampling process. By adopting the Negative-Prompt Inversion (NPI) concept, a target prompt representing the original image serves as the initial text value for optimization. This approach maintains performance comparable to NTI while reducing the average editing time by over 80% compared to the NTI method. Our method presents a promising approach for efficient, high-quality image editing based on diffusion models.	Presents WaveOpt-Estimator, a novel method to accelerate Null-Text Inversion (NTI) for efficient image editing with diffusion models.	NTI enables fine-grained image editing while preserving the original structure but suffers from long processing times.	Analyzes the relationship between image frequency components and NTI optimization endpoints using wavelet transform. Employs this analysis to train WaveOpt-Estimator which predicts optimal stopping points for NTI optimization, significantly reducing processing time.	Images with different frequency characteristics exhibit varying optimal NTI optimization endpoints. WaveOpt-Estimator accurately predicts optimization endpoints with a Mean Absolute Error (MAE) of 2.9 timesteps. Applying WaveOpt-Estimator to NTI achieves an 80% reduction in processing time compared to standard NTI while maintaining high image quality (PSNR ratio > 0.9).	The current implementation primarily focuses on image reconstruction without extensive evaluation on diverse editing prompts. Exploration of other frequency analysis techniques beyond wavelet transform could further enhance the WaveOpt-Estimator's performance.	image editing, diffusion models, null-text inversion, wavelet transform, optimization
2401.09742 Report	Image Translation as Diffusion Visual Programmers	Cheng Han, James C. Liang, Qifan Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, Dongfang Liu	We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.	This paper introduces Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework that combines a condition-flexible diffusion model with the GPT architecture for controllable and explainable image manipulation.	Existing diffusion-based image translation methods suffer from limitations such as condition-rigid learning, context-free incompetence, and system opacity. This paper addresses these limitations by enabling more flexible and interpretable image translation.	DVP leverages GPT to generate visual programs consisting of computer vision models for RoI identification, style transfer, and position manipulation. It utilizes instance normalization to enhance condition-flexibility and decomposes complex concepts into symbols for in-context reasoning.	DVP achieves state-of-the-art performance on image translation benchmarks, demonstrating high fidelity and quality. Instance normalization guidance in DVP's diffusion model enhances robustness and eliminates the need for manual guidance scale parameter tuning. The visual programming paradigm enables context-free editing, allowing for specific RoI modifications while preserving overall image coherence.	DVP struggles with image translation in challenging situations like poor photometric conditions and occluded objects, suggesting a need for specialized datasets and improved object segmentation. While instance normalization guidance is effective for text-guided diffusion, its application in broader image generation tasks requires further exploration.	image translation, diffusion models, visual programming, neuro-symbolic ai, explainable ai
2401.09732 Report	Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation	Zesen Cheng, Kehan Li, Hao Li, Peng Jin, Chang Liu, Xiawu Zheng, Rongrong Ji, Jie Chen	Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage image-text pretraining model for recognizing object instances by separately aligning each frame and class texts, ignoring the correlation between frames. As a result, the separation breaks the instance movement context of videos, causing inferior alignment between video and text. To tackle this issue, we propose to link frame-level instance representations as a Brownian Bridge to model instance dynamics and align bridge-level instance representation to class texts for more precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon a frozen video segmentor to generate frame-level instance queries, and design Temporal Instance Resampler (TIR) to generate queries with temporal context from frame queries. To mold instance queries to follow Brownian bridge and accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to learn discriminative bridge-level representations of instances via contrastive objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and exhibits 49.49% improvement compared to OV2Seg (4.97 mAP).	This paper proposes BriVIS, an open-vocabulary video instance segmentation method that leverages instance dynamics by modeling instance features as a Brownian Bridge and aligning the bridge center with class text embeddings.	Existing open-vocabulary VIS methods rely on aligning individual frames with class texts, neglecting the crucial temporal context of instance movement in videos.	BriVIS utilizes a frozen video segmentor to generate frame-level instance queries and employs a Temporal Instance Resampler (TIR) to capture temporal context. A Bridge-Text Alignment (BTA) module then links these features as a Brownian Bridge, aligning the bridge center with corresponding class texts via contrastive learning.	BriVIS significantly outperforms previous open-vocabulary VIS methods, achieving a 49.49% improvement on the BURST dataset. Analysis shows BriVIS effectively handles instances spanning long durations, indicating robust temporal modeling. BriVIS demonstrates competitive performance against close-vocabulary VIS methods, highlighting its strong vocabulary generalization ability.	The reliance on offline processing due to the Brownian Bridge modeling poses challenges for long videos or video streams. The implicit modeling of temporal context within the CLIP visual space limits its applicability to complex video tasks demanding profound temporal reasoning.	open-vocabulary video instance segmentation, brownian bridge, temporal context modeling, contrastive learning, vision-language pretraining
2401.09720 Report	GaussianBody: Clothed Human Reconstruction via 3d Gaussian Splatting	Mengtian Li, Shengxiang Yao, Zhifeng Xie, Keyu Chen	In this work, we propose a novel clothed human reconstruction method called GaussianBody, based on 3D Gaussian Splatting. Compared with the costly neural radiance based models, 3D Gaussian Splatting has recently demonstrated great performance in terms of training time and rendering quality. However, applying the static 3D Gaussian Splatting model to the dynamic human reconstruction problem is non-trivial due to complicated non-rigid deformations and rich cloth details. To address these challenges, our method considers explicit pose-guided deformation to associate dynamic Gaussians across the canonical space and the observation space, introducing a physically-based prior with regularized transformations helps mitigate ambiguity between the two spaces. During the training process, we further propose a pose refinement strategy to update the pose regression for compensating the inaccurate initial estimation and a split-with-scale mechanism to enhance the density of regressed point clouds. The experiments validate that our method can achieve state-of-the-art photorealistic novel-view rendering results with high-quality details for dynamic clothed human bodies, along with explicit geometry reconstruction.	This paper introduces GaussianBody, a novel method for clothed human reconstruction from monocular RGB videos, leveraging 3D Gaussian Splatting (3D-GS) for efficient high-fidelity reconstruction.	Existing methods struggle to balance high-fidelity reconstruction with fast training and rendering. This work addresses this by adapting 3D-GS for dynamic human reconstruction, enabling fast, detailed, and animatable human modeling.	The method utilizes SMPL for pose-guided deformation of canonical Gaussians. A physically-based prior regularizes Gaussian transformations, ensuring geometric consistency. A split-with-scale strategy enhances point cloud density and pose refinement improves SMPL parameter accuracy.	GaussianBody achieves state-of-the-art results in novel view synthesis on PeopleSnapshot dataset, outperforming baselines in PSNR, SSIM, and LPIPS metrics. The method generates high-quality point clouds that capture intricate clothing and body details, enabling accurate representation of non-rigid deformations. Ablation studies validate the contribution of the physically-based prior, pose refinement, and split-with-scale strategies to the reconstruction quality.	The current implementation faces challenges in novel pose synthesis due to sparse Gaussians and limitations in capturing complex non-rigid cloth deformations. Further investigation is needed to improve the integration of deformation MLPs for more robust and accurate non-rigid deformation handling.	3d human reconstruction, gaussian splatting, novel view synthesis, monocular reconstruction, physically-based priors
2401.09673 Report	Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack	Zhongliang Guo, Junhao Dong, Yifei Qian, Kaixuan Wang, Weiye Li, Ziheng Guo, Yuheng Wang, Yanli Li, Ognjen Arandjelović, Lei Fang	Neural style transfer (NST) generates new images by combining the style of one image with the content of another. However, unauthorized NST can exploit artwork, raising concerns about artists' rights and motivating the development of proactive protection methods. We propose Locally Adaptive Adversarial Color Attack (LAACA), empowering artists to protect their artwork from unauthorized style transfer by processing before public release. By delving into the intricacies of human visual perception and the role of different frequency components, our method strategically introduces frequency-adaptive perturbations in the image. These perturbations significantly degrade the generation quality of NST while maintaining an acceptable level of visual change in the original image, ensuring that potential infringers are discouraged from using the protected artworks, because of its bad NST generation quality. Additionally, existing metrics often overlook the importance of color fidelity in evaluating color-mattered tasks, such as the quality of NST-generated images, which is crucial in the context of artistic works. To comprehensively assess the color-mattered tasks, we propose the Adversarial Color Distance Metric (ACDM), designed to quantify the color difference of images pre- and post-manipulations. Experimental results confirm that attacking NST using LAACA results in visually inferior style transfer, and the ACDM can efficiently measure color-mattered tasks. By providing artists with a tool to safeguard their intellectual property, our work relieves the socio-technical challenges posed by the misuse of NST in the art community.	This paper introduces LAACA, a novel method to protect artwork from unauthorized neural style transfer by subtly perturbing style images to disrupt style transfer quality while maintaining visual fidelity.	Unauthorized neural style transfer poses risks to artists' rights, demanding proactive protection methods for digital artworks.	LAACA leverages frequency domain analysis to strategically embed perturbations in high-frequency areas of style images, maximizing disruption to style transfer while minimizing perceptual changes to the original artwork. Additionally, a new metric, ACDM, is proposed to quantify color differences in images pre- and post-manipulation, addressing the limitations of existing metrics in evaluating color-sensitive tasks.	LAACA effectively disrupts the quality of style transfer across five different NST methods, leading to visually inferior results. LAACA preserves the visual integrity of the original artwork, with minimal perceptible changes introduced by the adversarial perturbations. ACDM demonstrates superior performance compared to existing metrics like SSIMc and LPIPS in capturing color differences relevant for evaluating color-mattered tasks such as NST.	The current implementation primarily focuses on color disruption, and future work could explore incorporating texture-based disruptions for enhanced protection. Further research could investigate the generalization of LAACA to other domains beyond artistic style transfer where content-style separation is relevant.	adversarial attack, neural style transfer, copyright protection, image processing, computer vision
2401.09603 Report	Rethinking FID: Towards a Better Evaluation Metric for Image Generation	Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, Sanjiv Kumar	As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.	This paper argues that the commonly used Frèchet Inception Distance (FID) for evaluating image generation models has significant limitations and proposes an alternative metric called CMMD (CLIP-MMD).	FID, despite being widely adopted, shows discrepancies with human perception and fails to accurately capture improvements in iterative image generation models or under complex image distortions.	The paper analyzes FID's limitations, especially its reliance on normality assumptions for Inception embeddings which are often violated. It then proposes CMMD, which utilizes CLIP embeddings and the Maximum Mean Discrepancy (MMD) distance for a more robust and reliable evaluation.	Human evaluation shows that FID contradicts human perception of image quality, while CMMD aligns better with human judgment. CMMD accurately reflects the gradual improvement in iterative image generation models like Muse and Stable Diffusion, unlike FID which shows inconsistent behavior. CMMD effectively captures image quality degradation under complex distortions in the latent space where FID fails.	The paper acknowledges that the bandwidth parameter for the Gaussian RBF kernel in CMMD, while empirically observed to have insignificant impact, is fixed at 10 for consistency and proposes further investigation. Future work includes exploring other kernels for MMD and conducting more comprehensive human evaluations.	image generation, evaluation metrics, frèchet inception distance (fid), clip embeddings, maximum mean discrepancy (mmd)
2401.09419 Report	GARField: Group Anything with Radiance Fields	Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, Angjoo Kanazawa	Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/	Presents GARField, a method that decomposes 3D scenes into a hierarchy of semantically meaningful groups from posed images by optimizing a scale-conditioned 3D affinity feature field.	Grouping is inherently ambiguous due to multiple levels of granularity; GARField addresses this by using physical scale to consolidate groups into a hierarchy.	Distills 2D segmentation masks from SAM into a 3D volumetric scale-conditioned affinity field, using contrastive loss and containment auxiliary loss to ensure transitivity and containment properties. Hierarchical decomposition is achieved via recursive clustering at descending scales.	Effectively extracts groups at multiple levels (clusters of objects, objects, subparts). Produces consistent 3D groupings, often improving upon the quality of input 2D segmentation masks. Enables applications like 3D asset extraction and interactive segmentation.	Limited by the quality and coverage of input 2D masks. Current tree generation is naive and can lead to spurious small groups.	3d scene understanding, hierarchical grouping, scale-conditioned affinity field, nerf, segmentation
2401.09417 Report	Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model	Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang	Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.	This paper proposes a novel vision backbone named Vision Mamba (Vim) built upon bidirectional Mamba blocks, marking a departure from self-attention reliance in visual representation learning.	This approach aims to address the challenges faced by state space models (SSMs) in visual data representation, particularly the position-sensitivity of visual data and the need for global context. It holds promise for efficient and generic vision backbones based on SSMs.	Vim utilizes position embeddings to encode spatial information within image sequences and leverages bidirectional state space models to compress visual representations.	Vim outperforms established vision transformers like DeiT in ImageNet classification, COCO object detection, and ADE20k semantic segmentation. Vim demonstrates superior computational and memory efficiency compared to DeiT, especially with high-resolution images. For instance, Vim is 2.8 times faster and saves 86.8% GPU memory than DeiT during batch inference on images with a 1248x1248 resolution.	The paper does not explicitly mention limitations. Future work could focus on exploring the applicability of Vim in other vision tasks and datasets beyond those investigated in the paper.	vision transformer, state space model, mamba, vision backbone, efficient deep learning
2401.09416 Report	TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion	Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, Zhengqin Li	We present TextureDreamer, a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry, while learning-based methods are confined to category-specific shapes within the dataset. In contrast, TextureDreamer can transfer highly detailed, intricate textures from real-world environments to arbitrary objects with only a few casually captured images, potentially significantly democratizing texture creation. Our core idea, personalized geometry-aware score distillation (PGSD), draws inspiration from recent advancements in diffuse models, including personalized modeling for texture information extraction, variational score distillation for detailed appearance synthesis, and explicit geometry guidance with ControlNet. Our integration and several essential modifications substantially improve the texture quality. Experiments on real images spanning different categories show that TextureDreamer can successfully transfer highly realistic, semantic meaningful texture to arbitrary objects, surpassing the visual quality of previous state-of-the-art.	TextureDreamer, a novel image-guided texture synthesis method that transfers relightable textures from a few input images (3-5) to target 3D shapes.	Texture creation is crucial for realistic 3D content, but existing methods require dense views or are category-specific. This method offers a more accessible approach for diverse objects.	The method combines personalized Dreambooth fine-tuning for texture extraction, variational score distillation (VSD) for realistic appearance, and ControlNet for geometry-aware generation (PGSD).	Transfers highly detailed textures from real-world images to arbitrary objects. Generates semantically meaningful textures that align with target geometry. Outperforms state-of-the-art methods in qualitative and quantitative evaluations.	May bake in lighting from input images into textures. Struggles with transferring special and non-repeated textures.	texture synthesis, diffusion models, image-guided, 3d content creation, neural rendering
2401.09414 Report	Vlogger: Make Your Dream A Vlog	Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang	In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor. The code and model is all available at https://github.com/zhuangshaobin/Vlogger.	This paper introduces Vlogger, an AI system that uses large language models (LLMs) and foundation models to automatically generate minute-long, coherent video blogs (vlogs) from user stories.	Existing video generation methods struggle to create long, coherent videos with diverse scenes and complex storylines, which are characteristic of vlogs. Vlogger addresses these limitations.	Vlogger decomposes vlog generation into four stages: (1) Script creation with LLM as director, (2) Actor design with a character designer, (3) Video shooting with a novel video diffusion model (ShowMaker), and (4) Voiceover using a text-to-speech model.	Vlogger achieves state-of-the-art performance on zero-shot text-to-video generation and prediction tasks. It generates vlogs longer than 5 minutes from open-world descriptions, maintaining coherence in script and actor portrayal. The novel ShowMaker component demonstrates effectiveness in generating controllable-duration video snippets with strong spatial-temporal coherence.	The current implementation relies on multiple foundation models, which can be computationally expensive. Future work will explore generating higher-resolution vlogs and incorporating more sophisticated editing techniques.	video generation, large language models, video diffusion models, vlog generation, foundation models
2401.09413 Report	POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images	Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic	We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.	This paper proposes POP3D, a novel method for open-vocabulary 3D semantic occupancy prediction from 2D images, enabling 3D grounding, segmentation, and retrieval of objects based on free-form language queries.	This approach addresses the limitations of traditional 3D semantic occupancy prediction methods that rely on manually annotated 3D data and are restricted to a predefined set of object classes.	POP3D employs a tri-modal self-supervised learning algorithm, leveraging images, LiDAR point clouds, and a pre-trained image-language network (MaskCLIP+). The architecture consists of a 2D-3D encoder, an occupancy prediction head, and a 3D-language head.	POP3D achieves superior occupancy prediction compared to a fully supervised counterpart, demonstrating the effectiveness of the proposed tri-modal self-supervised learning approach. The method demonstrates strong performance on zero-shot 3D semantic segmentation, showcasing its open-vocabulary capabilities. POP3D exhibits promising results for language-driven 3D grounding and retrieval tasks, enabling interaction with 3D scenes using natural language queries.	The model's performance is limited by the resolution of the voxel grid, hindering its ability to detect small objects. The architecture does not natively support image sequences, which could be beneficial for reasoning about occluded objects and dynamic scenes. Future work could explore incorporating temporal information into the model.	3d semantic occupancy prediction, open-vocabulary learning, tri-modal self-supervised learning, language-driven 3d grounding, zero-shot semantic segmentation
2401.09340 Report	SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding	Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang	3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.	This work presents \dataset, the first million-scale 3D vision-language dataset for grounded scene understanding, and \model, a unified pre-training framework based on multi-level contrastive alignment.	Grounding language in 3D scenes is crucial for embodied agents but faces challenges due to scene complexity, data scarcity, and lack of unified learning frameworks.	\dataset is created by combining existing 3D scene data with automatically generated scene-language pairs using scene graphs and LLMs. \model leverages this data with multi-level contrastive alignment for object-level, scene-level, and referral-object-level grounding.	\model achieves state-of-the-art results on all existing 3D visual grounding benchmarks. Pre-trained \model shows strong zero-shot generalization capabilities for grounded scene understanding. Scaling data, especially with realistic scenes, significantly benefits 3D visual grounding and other 3D understanding tasks like semantic segmentation.	The domain gap between real and synthetic scenes poses challenges for generalization. Future work should focus on collecting more diverse, realistic, and large-scale 3D scenes.	3d vision-language grounding, 3d scene understanding, million-scale dataset, contrastive learning, zero-shot transfer
2401.09084 Report	UniVG: Towards UNIfied-modal Video Generation	Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao	Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.	This paper presents UniVG, a unified video generation system that handles multiple video generation tasks (e.g., text-to-video, image-to-video) within a single framework.	Current video generation models are limited to single-objective or single-task pipelines, lacking flexibility to meet diverse user needs who might input text and image conditions in various combinations.	UniVG categorizes video generation tasks by "generative freedom": high-freedom (e.g., text/image-to-video) uses Multi-condition Cross Attention, and low-freedom (e.g., image animation, super-resolution) employs Biased Gaussian Noise for better content preservation.	UniVG achieves the lowest FVD on MSR-VTT benchmark, surpassing open-source methods. Human evaluations show UniVG is on par with the closed-source Gen2 and outperforms other open-source methods. Ablation studies confirm the effectiveness of Multi-condition Cross Attention and Biased Gaussian Noise for their respective categories.	The current model struggles to generate videos with a large amount of motion, potentially due to limitations in training data. Future work will explore alternative solutions for Biased Gaussian Noise and extend its application to other low-freedom video generation tasks.	video generation, diffusion models, multi-modal generation, image animation, video super-resolution
2401.09050 Report	Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior	Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang	Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.	Consistent3D, a novel text-to-3D generation method using deterministic sampling priors to address the geometry collapse and poor texture issues in Score Distillation Sampling (SDS).	SDS, while effective, suffers from geometry collapse and poor textures due to the inherent randomness in its SDE-based sampling process.	Consistent3D leverages the deterministic nature of ODE trajectories, proposing a Consistency Distillation Sampling (CDS) loss to distill deterministic priors into a 3D model. It utilizes a fixed noise perturbation and samples adjacent points on the ODE trajectory to guide 3D model optimization.	Generates high-fidelity and diverse 3D objects and large-scale scenes. Outperforms existing methods like DreamFusion, Magic3D, and ProlificDreamer in both qualitative and quantitative evaluations. Effectively addresses the randomness issue in SDS, providing more consistent and reliable guidance for 3D model optimization.	Reliance on pre-trained diffusion models without 3D priors might limit performance in complex scenarios. Potential bias transfer from pre-trained models needs further investigation.	text-to-3d generation, score distillation sampling, diffusion models, ordinary differential equations, deterministic sampling
2401.09048 Report	Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis	Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun Jeong	Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/	The paper introduces Compose and Conquer (CnC), a text-conditional diffusion model that integrates control over 3D object placement with disentangled global stylistic semantics from multiple exemplar images.	Existing text-conditional diffusion models struggle with accurate 3D object placement and localizing global semantics from multiple sources. CnC addresses these limitations.	CnC utilizes two novel techniques: Depth Disentanglement Training (DDT) for 3D object placement and 'soft guidance' for localizing global semantics. DDT leverages synthetic image triplets to teach the model relative depth, while soft guidance selectively masks cross-attention layers to inject regional semantics.	CnC outperforms baseline models in generating images with accurate 3D object placement and localized global semantics. The model demonstrates strong reconstruction ability, faithfully recreating objects in varying depths. Soft guidance effectively prevents concept bleeding, ensuring global semantics are applied to targeted regions without unintended overlaps.	The current framework limits the number of conditions and disentangled spatial grounds. Future work includes decomposing images into finer depth primitives and exploring the 'middle ground'.	diffusion models, 3d object placement, global semantic localization, depth disentanglement training, soft guidance
2401.09047 Report	VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan	Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.	This paper presents a method to train high-quality video diffusion models without relying on large-scale, high-quality video datasets.	Existing research on text-to-video generation struggles to produce high-quality videos due to the reliance on low-quality datasets like WebVid-10M, while commercial models use private high-quality data inaccessible to the public.	The authors analyze the connection between spatial and temporal modules in video diffusion models and leverage it to overcome data limitations. They propose a two-stage pipeline: 1) Fully train a video model with low-quality videos. 2) Fine-tune the spatial modules of the trained model using synthesized high-quality images.	The method generates videos with comparable visual quality to commercial models trained on high-quality videos. The proposed training strategy maintains good motion consistency without significant degradation. Using synthesized images with complex concepts for finetuning improves the model's concept composition ability.	The motion quality, while improved, is not yet on par with models trained on large-scale, high-quality video data. The research focuses on a specific video diffusion model architecture based on Stable Diffusion; generalizability to other architectures needs further exploration.	text-to-video generation, video diffusion models, data limitations, stable diffusion, concept composition
2401.08973 Report	OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality	Aditya Sharma, Luke Yoffe, Tobias Höllerer	One key challenge in Augmented Reality is the placement of virtual content in natural locations. Most existing automated techniques can only work with a closed-vocabulary, fixed set of objects. In this paper, we introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. Through a multifaceted evaluation, we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark for automatically evaluating the placement of virtual objects in augmented reality, alleviating the need for costly user studies. Through this, in addition to human evaluations, we find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics.	This paper introduces OCTO+, a novel pipeline for automatically placing virtual objects in augmented reality scenes using open-vocabulary vision-language models, and PEARL, a benchmark for evaluating these placements.	Automatic and natural object placement is crucial for AR applications but challenging due to the need for open-vocabulary understanding and reasoning about object relationships.	OCTO+ uses a 3-stage approach: 1) Image Understanding (RAM++ with Grounding DINO filtering), 2) Reasoning (GPT-4 to select the most natural placement surface), and 3) Locating (Grounded-Segment-Anything to determine 2D coordinates).	OCTO+ outperforms previous methods, including GPT-4V and OCTOPUS, in placing objects naturally. The proposed PEARL-Score metric, based on placement within valid regions and distance from edges, aligns with human judgment. Human evaluation confirms that OCTO+ achieves natural placements comparable to expert annotations.	Current pipeline is slow, taking up to 10 seconds per placement. Placement logic doesn't consider object-specific orientations or more complex spatial relationships.	object placement, open-vocabulary, benchmark, mixed reality, vision-language models
2401.08937 Report	ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization	Weiyao Wang, Pierre Gleize, Hao Tang, Xingyu Chen, Kevin J Liang, Matt Feiszli	Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces ``confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON, without prior pose initialization, achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.	This paper proposes ICON (Incremental CONfidence), an optimization procedure for training NeRFs from 2D video frames without requiring pose initialization by leveraging smooth camera motion and confidence-guided optimization.	Existing methods for 3D object reconstruction from monocular video either rely on depth information or accurate camera poses, limiting their applicability in real-world scenarios where depth is unavailable and pose estimation is challenging.	ICON incrementally registers video frames by leveraging motion smoothness and introduces a Neural Confidence Field to measure confidence in pose and 3D structure, using it to reweight gradients during optimization and escape local minima.	ICON achieves superior performance on CO3D compared to methods requiring SfM pose initialization, even surpassing NeRF trained with COLMAP poses. On HO3D, ICON achieves comparable pose tracking accuracy to state-of-the-art RGB-D methods while using only RGB input. Ablation studies demonstrate the importance of incremental registration, confidence-based optimization, and restarts for handling challenging scenarios.	ICON heavily relies on photometric consistency across viewpoints, limiting its performance in scenes with significant lighting variations, reflections, or transparency. The reliance on gradient-based optimization through NeRF makes training computationally expensive.	neural radiance fields, pose estimation, 3d reconstruction, confidence-based optimization, incremental registration
2401.08930 Report	3D Human Pose Analysis via Diffusion Synthesis	Haorui Ji, Hongdong Li	Diffusion models have demonstrated remarkable success in generative modeling. In this paper, we propose PADS (Pose Analysis by Diffusion Synthesis), a novel framework designed to address various challenges in 3D human pose analysis through a unified pipeline. Central to PADS are two distinctive strategies: i) learning a task-agnostic pose prior using a diffusion synthesis process to effectively capture the kinematic constraints in human pose data, and ii) unifying multiple pose analysis tasks like estimation, completion, denoising, etc, as instances of inverse problems. The learned pose prior will be treated as a regularization imposing on task-specific constraints, guiding the optimization process through a series of conditional denoising steps. PADS represents the first diffusion-based framework for tackling general 3D human pose analysis within the inverse problem framework. Its performance has been validated on different benchmarks, signaling the adaptability and robustness of this pipeline.	PADS: a novel framework that tackles various 3D human pose analysis problems in a unified diffusion-based pipeline, formulating them as instances of inverse problems.	Addresses limitations of current 3D human pose analysis methods that rely on large paired datasets and are limited in application scope.	1) Learns a task-agnostic pose prior using a diffusion synthesis process to capture kinematic constraints in human pose data. 2) Unifies multiple pose analysis tasks as inverse problems, using the learned pose prior as regularization during optimization through conditional denoising steps.	Achieves state-of-the-art performance on H36M for 3D human pose estimation, outperforming existing unsupervised methods. Demonstrates effective pose denoising capabilities, handling various noise types and intensities. Shows promising results in pose completion, successfully reconstructing missing parts of human skeletons.	Currently validated only on pose-based representations, with potential for generalization to mesh or implicit function representations. Utilizes an image domain inverse problem solver (DPS); a tailored solver for human pose analysis could enhance performance.	3d human pose analysis, diffusion models, inverse problems, pose prior, zero-shot learning
2401.08815 Report	Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive	Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva	Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).	The paper introduces adversarial supervision and multistep unrolling strategy for layout-to-image (L2I) diffusion models to improve layout faithfulness without sacrificing text editability.	Current L2I models struggle to balance adherence to layout conditions and flexibility in text-based editing, limiting their practical applications.	The authors employ a segmentation-based discriminator to provide explicit feedback on layout alignment and introduce multistep unrolling to enforce consistent adherence to the layout over sampling steps.	Adversarial supervision and multistep unrolling consistently improve layout faithfulness across different L2I diffusion models. The proposed ALDM model achieves a balance between layout faithfulness (high mIoU) and text editability (high TIFA score). Synthetic data augmentation using ALDM significantly improves domain generalization performance for semantic segmentation.	Attribute editing can leak to unintended objects. Perfect alignment with the layout map is not always achieved, especially with rare text prompts.	layout-to-image synthesis, diffusion models, adversarial training, text controllability, domain generalization
2401.08742 Report	Fast Dynamic 3D Object Generation from a Single-view Video	Zijie Pan, Zeyu Yang, Xiatian Zhu, Li Zhang	Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. Extending image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling, existing methods tend to be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly train a novel 4D Gaussian splatting model with explicit point cloud geometry, enabling real-time rendering under continuous camera trajectories. Extensive experiments on synthetic and real videos show that Efficient4D offers a remarkable 20-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 6 mins to model a dynamic object, vs 120 mins by Consistent4D.	This paper proposes "Efficient4D", an efficient two-staged pipeline for generating dynamic 3D objects from single-view videos.	Existing methods for 4D object generation are computationally expensive and slow, hindering practical applications. Efficient4D addresses this efficiency challenge while maintaining high-quality novel view synthesis.	The first stage generates temporally consistent multi-view images using a modified SyncDreamer with time-synchronous spatial volumes and frame interpolation. The second stage reconstructs the dynamic object using a novel 4D Gaussian splatting model optimized with a confidence-aware loss.	Efficient4D achieves a 20x speedup compared to prior art (Consistent4D). It demonstrates superior novel view synthesis quality both qualitatively and quantitatively. The method is robust even with sparse input, generating smooth dynamics from as few as two frames.	The local smoothing approach in image generation struggles with long videos. Future work could explore learnable attention layers for global receptive fields. Handling long videos might require significant GPU memory. Utilizing multi-GPU or CPU solutions could alleviate this issue at the cost of processing time.	4d generation, gaussian splatting, video, efficiency, novel view synthesis
2401.08741 Report	Fixed Point Diffusion Models	Xingjian Bai, Luke Melas-Kyriazi	We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method, this approach significantly reduces model size, reduces memory usage, and accelerates training. Moreover, it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model, FPDM contains 87% fewer parameters, consumes 60% less memory during training, and improves image generation quality in situations where sampling computation or time is limited. Our code and pretrained models are available at https://lukemelas.github.io/fixed-point-diffusion-models.	The paper introduces Fixed Point Diffusion Model (FPDM), a novel image generation approach integrating fixed point solving into diffusion models for reduced model size, memory usage, and improved sampling efficiency.	Diffusion models are computationally expensive, posing challenges for deployment, especially on resource-constrained devices. FPDM addresses this by significantly reducing resource requirements while maintaining or enhancing image generation quality.	FPDM incorporates an implicit fixed point layer within a denoising diffusion model, transforming the diffusion process into a sequence of fixed point problems. This allows for flexible computation allocation across timesteps and reuse of solutions between timesteps. A new training method, Stochastic Jacobian-Free Backpropagation (S-JFB), enables efficient training of the implicit layer.	FPDM achieves superior image generation quality compared to DiT with significantly fewer parameters (87% reduction) and lower memory usage (60% reduction) when sampling computation is limited. Smoothing computation across multiple timesteps in FPDM proves more effective than using fewer timesteps with more iterations per step, as is necessary in standard diffusion models. Reusing fixed point solutions from previous timesteps significantly accelerates convergence, especially when the number of iterations per timestep is limited.	FPDM's performance is slightly worse than DiT when sampling computation is not constrained, suggesting further optimization is needed for high-compute regimes. The paper focuses on image generation, leaving exploration of FPDM's applicability to other domains, such as video or audio generation, for future work.	diffusion models, image generation, implicit neural networks, fixed point solving, efficient sampling
2401.08740 Report	SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie	We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.	Presents Scalable Interpolant Transformers (SiT), a family of generative models built on Diffusion Transformers (DiT) that surpasses DiT's performance by leveraging a flexible interpolant framework.	To explore design choices impacting generative models built on dynamical transport and to simplify the learning problem for improved performance.	Gradually transitions from a denoising diffusion model to an interpolant model, exploring choices of: discrete vs. continuous time learning, predicting velocity vs. score, various interpolants, and deterministic or stochastic sampling.	SiT surpasses DiT across all model sizes on ImageNet 256x256 benchmark using identical backbones and training compute. Learning a velocity model with a weighted score objective significantly improves performance over learning a score model. SDE sampling generally outperforms ODE sampling, with tunable diffusion coefficients further enhancing results.	The performance comparison between DDIM and Heun samplers is not directly comparable due to different orders of discretization. Exploration of higher-order solvers did not yield performance improvements.	generative models, diffusion models, transformers, image generation, stochastic interpolants
2401.08725 Report	Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks	Chenyu Zhang, Lanjun Wang, Anan Liu	Recent developments in text-to-image models, particularly Stable Diffusion, have marked significant achievements in various applications. With these advancements, there are growing safety concerns about the vulnerability of the model that malicious entities exploit to generate targeted harmful images. However, the existing methods in the vulnerability of the model mainly evaluate the alignment between the prompt and generated images, but fall short in revealing the vulnerability associated with targeted image generation. In this study, we formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts. Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images. Furthermore, after obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model. Extensive experiments on two targeted attack tasks demonstrate the effectiveness of our method in targeted attacks. The code can be obtained in https://github.com/datar001/Revealing-Vulnerabilities-in-Stable-Diffusion-via-Targeted-Attacks.	This paper proposes a targeted adversarial attack framework for Stable Diffusion to generate images of specific categories (objects or styles) from seemingly unrelated prompts.	This work addresses the growing safety concerns regarding the vulnerability of text-to-image models like Stable Diffusion to malicious manipulation for generating harmful content.	The framework uses two perturbation strategies (word substitution, suffix addition), a gradient-based embedding optimization method, and leverages image-text matching similarity to guide adversarial prompt generation. It also includes techniques to enhance prompt stealthiness and maintain semantic consistency.	The proposed method significantly outperforms existing attack methods in terms of attack success rate and generated image quality. The study reveals that verbs and prepositions play a crucial role in manipulating image generation, while longer suffixes increase the likelihood of successful attacks. The analysis of successful attacks reveals vulnerabilities in Stable Diffusion related to the use of culturally diverse lexicon, hidden semantic connections, and the influence of early denoising steps on image generation.	The negative correlation between attack success rate and semantic consistency in style attacks needs further investigation. Future work will focus on exploring more sophisticated perturbation strategies and delving deeper into the model's vulnerability mechanisms.	adversarial attack, stable diffusion, text-to-image generation, model vulnerability, prompt engineering
2401.08570 Report	RoHM: Robust Human Motion Reconstruction via Diffusion	Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo	We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.	This paper introduces RoHM, a novel diffusion-based approach for robust 3D human motion reconstruction from monocular RGB(-D) videos, effectively handling noise and occlusions.	Reconstructing plausible 3D human motion from monocular videos is crucial for various applications, but existing methods often struggle with noise and occlusions, especially over extended periods. RoHM addresses these challenges by leveraging the iterative and generative nature of diffusion models.	ROHM employs two diffusion models: TrajNet for global trajectory reconstruction and PoseNet for local pose estimation. It introduces TrajControl, a flexible conditioning module, to capture correlations between global and local motion. The model is trained on a large-scale motion capture dataset with synthetic noise and occlusions, and employs an iterative inference scheme for motion refinement. To further enhance realism, score-guided sampling is used with physics-based and image-based scores.	ROHM significantly outperforms state-of-the-art optimization-based methods in terms of accuracy and physical plausibility, as evidenced by experiments on AMASS, PROX, and EgoBody datasets. The method exhibits robustness to high levels of noise and varying occlusion patterns, demonstrating its ability to recover realistic motion dynamics even from significantly corrupted input. ROHM achieves a substantial speedup of 30 times compared to optimization-based counterparts during inference, making it a promising approach for real-time applications.	One limitation is its current offline processing nature, hindering real-time performance. The method relies on 3D scene information and 2D joint detections for occlusion handling, posing challenges when these inputs are unreliable or unavailable.	3d human motion reconstruction, diffusion models, motion denoising and in-filling, robust motion estimation, monocular rgb(-d) videos
2401.08559 Report	Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation	Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, Davis Rempe	Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.	This paper introduces "multi-track timeline control" for text-driven 3D human motion synthesis, allowing users to specify complex actions with precise timing using a timeline interface.	Current text-to-motion synthesis methods lack fine-grained control, making it difficult to compose multiple actions and define precise durations. This new method offers an intuitive solution for animators.	The authors propose "Spatio-Temporal Motion Collage" (STMC), a test-time denoising method that leverages pre-trained motion diffusion models. STMC processes individual timeline intervals independently, stitching them together spatially and temporally for coherent motion.	STMC outperforms baselines adapted from existing methods in both semantic correctness and realism, as demonstrated by quantitative metrics and perceptual studies. The method effectively handles spatial and temporal composition, accurately reflecting the semantics and timing of text prompts within the timeline. The authors also introduce an improved motion diffusion model with SMPL support, resulting in faster sampling and direct SMPL pose generation.	The method's performance depends on the underlying pre-trained diffusion model, inheriting its limitations in handling complex compositions. Current implementation restricts overlapping motions to compatible body parts, similar to the SINC method.	3d human motion synthesis, text-driven animation, timeline control, motion diffusion models, spatio-temporal motion collage
2401.08541 Report	Scalable Pre-training of Large Autoregressive Image Models	Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin	This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.	The paper introduces Autoregressive Image Models (AIM), a collection of vision models pre-trained with an autoregressive objective, achieving competitive performance with scaling properties similar to Large Language Models (LLMs).	The paper explores the generalization of LLM's success in scaling transformers with an autoregressive objective to the vision domain.	The paper utilizes a prefix attention mechanism, heavily parameterized token-level prediction head, and pixel-level regression loss to train ViT models on a large dataset of uncurated web images (DFN-2B) with an autoregressive objective.	AIM performance scales with both model capacity and data quantity. The autoregressive objective function value correlates with downstream task performance. A 7 billion parameter AIM pre-trained on 2 billion images achieves 84.0% accuracy on ImageNet-1k with a frozen trunk, outperforming prior generative methods and nearing joint embedding method performance.	Other methods like MAE show higher sample efficiency and lower risk of overfitting with smaller datasets. Contrastive methods achieve better performance for a given model size but face scalability and loss tractability challenges.	autoregressive models, vision transformers, self-supervised learning, pre-training at scale, generative pre-training
2401.08472 Report	Instilling Multi-round Thinking to Text-guided Image Generation	Lidong Zeng, Zhedong Zheng, Yinwei Wei, Tat-seng Chua	This paper delves into the text-guided image editing task, focusing on modifying a reference image according to user-specified textual feedback to embody specific attributes. Despite recent advancements, a persistent challenge remains that the single-round generation often overlooks crucial details, particularly in the realm of fine-grained changes like shoes or sleeves. This issue compounds over multiple rounds of interaction, severely limiting customization quality. In an attempt to address this challenge, we introduce a new self-supervised regularization, \ie, multi-round regularization, which is compatible with existing methods. Specifically, the multi-round regularization encourages the model to maintain consistency across different modification orders. It builds upon the observation that the modification order generally should not affect the final result. Different from traditional one-round generation, the mechanism underpinning the proposed method is the error amplification of initially minor inaccuracies in capturing intricate details. Qualitative and quantitative experiments affirm that the proposed method achieves high-fidelity editing quality, especially the local modification, in both single-round and multiple-round generation, while also showcasing robust generalization to irregular text inputs. The effectiveness of our semantic alignment with textual feedback is further substantiated by the retrieval improvements on FahisonIQ and Fashion200k.	This paper proposes a novel self-supervised regularization method for text-guided image editing, enhancing the consistency and accuracy of multi-round generation, particularly for fine-grained details.	Existing single-round generation methods often miss crucial details, especially in multi-round interactions, limiting the quality of customization.	The proposed method encourages consistency across different modification orders by optimizing error accumulation through a novel multi-round regularization loss. The approach leverages a pre-trained diffusion model and CLIP encoders for text and image representations.	The method demonstrates superior performance on FashionIQ and Fashion200k datasets in terms of both visual quality (FID) and semantic alignment (CLIP Score, Recall@K). It exhibits robust generalization to ill-formed text inputs, including swapped sentence order, rotated word order, and masked words. The proposed approach effectively captures fine-grained details and maintains consistency across multiple rounds of generation.	The model relies on pre-trained encoders and diffusion models, potentially limiting its flexibility in handling novel concepts. The current work primarily focuses on two-round consistency; future work could explore extending it to a greater number of rounds.	image editing, text guidance, multi-round thinking, self-supervised learning, fine-grained generation
2401.08392 Report	DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)	Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang	Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Hence, this paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a video agent. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios. The code will be released at https://github.com/z-x-yang/DoraemonGPT.	This paper presents DoraemonGPT, an LLM-driven system for understanding dynamic scenes, exemplified as a video agent that can understand and reason about videos.	Understanding dynamic scenes is crucial for real-life applications of AI, such as guiding students in lab experiments or analyzing surveillance footage, which current image-based LLM agents struggle with.	DoraemonGPT converts videos into a symbolic memory (space-dominant and time-dominant), utilizes sub-task tools for spatial-temporal reasoning, incorporates external knowledge tools, and employs an LLM-driven MCTS planner to explore solutions.	DoraemonGPT outperforms state-of-the-art LLM-driven agents on video question answering (NExT-QA, TVQA+) and referring object segmentation (Ref-YouTube-VOS). The MCTS planner effectively explores the solution space, leading to more accurate and comprehensive answers compared to greedy search methods. DoraemonGPT demonstrates its ability to handle complex, in-the-wild scenarios, including checking experimental operations, video understanding, and video editing.	The current design of memory types relies on heuristics and lacks an automated approach. The performance of DoraemonGPT is inherently tied to the capabilities and limitations of the foundation models it employs.	large language models, video understanding, dynamic scene understanding, visual reasoning, monte carlo tree search
2401.08276 Report	AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception	Yipo Huang, Quan Yuan, Xiangfei Sheng, Zhichao Yang, Haoning Wu, Pengfei Chen, Yuzhe Yang, Leida Li, Weisi Lin	With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at https://github.com/yipoh/AesBench.	This paper introduces AesBench, an expert-designed benchmark to comprehensively evaluate the aesthetic perception abilities of Multimodal Large Language Models (MLLMs).	The effectiveness of MLLMs on image aesthetics perception, a crucial aspect in various real-world applications, remains underexplored. This benchmark aims to systematically evaluate and potentially guide the development of MLLMs with enhanced aesthetic perception capabilities.	AesBench encompasses two key components: (1) EAPD, a high-quality dataset with diverse images and expert annotations covering aesthetic attributes, emotional responses, quality assessments, and interpretations. (2) A four-dimensional evaluation framework based on Perception, Empathy, Assessment, and Interpretation, with criteria designed to assess MLLMs' understanding and reasoning about image aesthetics.	Current MLLMs demonstrate limited aesthetic perception abilities, showing a significant gap compared to human performance. AesBench effectively differentiates the aesthetic perception capabilities across various MLLMs, with Q-Instruct, GPT-4V, and Gemini Pro Vision showcasing relatively better performance. MLLMs struggle particularly with aesthetic interpretation, often exhibiting hallucinations and lacking precision in their reasoning.	The study is limited by the reliance on GPT-assisted evaluation for certain tasks due to the open-ended nature of responses. Future work can explore expanding the dataset with more diverse image styles and cultural contexts to enhance the generalizability of the benchmark.	multimodal large language models, image aesthetics perception, benchmarking, expert annotations, aesthetic interpretation
2401.08100 Report	KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain	Anh-Cuong Pham, Van-Quang Nguyen, Thi-Hong Vuong, Quang-Thuy Ha	Image captioning is a crucial task with applications in a wide range of domains, including healthcare and education. Despite extensive research on English image captioning datasets, the availability of such datasets for Vietnamese remains limited, with only two existing datasets. In this study, we introduce KTVIC, a comprehensive Vietnamese Image Captioning dataset focused on the life domain, covering a wide range of daily activities. This dataset comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable resource for advancing image captioning in the Vietnamese language. We conduct experiments using various deep neural networks as the baselines on our dataset, evaluating them using the standard image captioning metrics, including BLEU, METEOR, CIDEr, and ROUGE. Our findings underscore the effectiveness of the proposed dataset and its potential contributions to the field of image captioning in the Vietnamese context.	This paper introduces KTVIC, a novel Vietnamese image captioning dataset focused on daily life activities, featuring 4,327 images with 5 captions each (totaling 21,635 captions).	Existing Vietnamese image captioning datasets are limited, hindering research in this domain. KTVIC addresses this gap by providing a comprehensive resource for advancing Vietnamese image captioning.	KTVIC leverages images from the UIT-EVJVQA dataset, annotating each with 5 captions following established guidelines. Three baseline models (CNN-LSTM, ViT-Transformer, GRIT) are evaluated on the dataset.	KTVIC proves effective, enabling all baseline models to generate meaningful Vietnamese captions. Transformer-based models (ViT-Transformer, GRIT) outperform the CNN-LSTM model, highlighting the strength of Transformers in this task. GRIT, utilizing both grid and region features, achieves the best performance, demonstrating the effectiveness of this approach for Vietnamese image captioning.	All baselines are fine-tuned using cross-entropy loss without further optimization techniques like CIDEr-D. Future work can explore more advanced architectures and optimization strategies to further enhance Vietnamese image captioning performance.	vietnamese image captioning, image captioning dataset, deep neural networks, computer vision, natural language processing
2401.08053 Report	SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation	Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, Jean Oh	Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline, which is further improved with our SCoFT technique.	This paper introduces SCoFT, a novel fine-tuning method for pre-trained text-to-image models to improve the cultural representation in generated images and reduce harmful stereotypes.	Accurate representation in media is crucial for well-being and understanding of diverse cultures. Existing models trained on large, unfiltered datasets often perpetuate harmful stereotypes and misrepresent cultures.	The authors collected CCUB, a culturally representative dataset with images and captions. They then proposed SCoFT, a self-contrastive fine-tuning technique that leverages pre-trained model's biases by using its generated images as negative examples and CCUB images as positive examples during training.	Fine-tuning on CCUB significantly reduces offensiveness and increases cultural relevance in generated images compared to the baseline Stable Diffusion model. SCoFT further enhances these improvements by leveraging perceptual loss and a novel self-contrastive approach. User studies with participants from diverse cultures confirm SCoFT’s effectiveness in generating more culturally representative and less stereotypical images.	The current approach primarily focuses on generating accurate images within a specific cultural context, with future work exploring the generation of diverse images for generic prompts. While CCUB was curated by cultural experts, more rigorous verification methods could be employed to further enhance the dataset's quality.	culturally-aware image synthesis, text-to-image generation, stereotype mitigation, fine-tuning, contrastive learning
2401.07781 Report	Towards A Better Metric for Text-to-Video Generation	Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou	Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.	This paper introduces T2VScore, a novel automatic evaluation metric for text-to-video generation, assessing both text-video alignment and video quality.	Existing automated metrics for evaluating text-to-video models fall short in capturing temporal aspects and often misalign with human perception.	T2VScore comprises two metrics: T2VScore-A, using vision-language models for text-video alignment assessment via question answering, and T2VScore-Q, employing a mix-of-experts approach combining technical and semantic quality evaluation for video quality assessment. The authors further present the TVGE dataset with human judgments on alignment and quality for benchmarking.	T2VScore demonstrates superior correlation with human judgments compared to baseline metrics on the TVGE dataset. Auxiliary trajectory information significantly enhances temporal understanding for evaluating text-video alignment. The proposed adaptation strategy effectively generalizes T2VScore-Q to unseen text-to-video models.	T2VScore-A's performance relies on the capabilities of multimodal large language models, which are still under development. The TVGE dataset will be continuously expanded with more open-source text-to-video models.	text-to-video generation, evaluation metric, video quality assessment, text-video alignment, multimodal large language models
2401.07770 Report	Seeing the Unseen: Visual Common Sense for Semantic Placement	Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs	Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space). Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context from web, and then remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images across $9$ object categories, and train a SP prediction model called CLIP-UNet. CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors on real-world and simulated images. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and $31.3\%$ times when comparing against the $4$ SP baselines on real and simulated images. In addition, we demonstrate leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments.	This paper introduces Semantic Placement (SP), a novel task where a vision system predicts a binary mask highlighting semantically meaningful regions for placing a given object in an image.	SP is crucial for applications like assistive robots, AR devices, and visually-grounded chatbots, requiring common-sense visual understanding of plausible object placements.	The authors propose an automated data pipeline leveraging inpainting and object detection to generate a large-scale dataset of images with and without objects. They then train a CLIP-UNet model, combining a CLIP backbone with a language-conditioned UNet decoder, to predict SP masks.	CLIP-UNet outperforms baselines combining LLMs with object detectors and VLM baselines (LLaVa, GPT4V) on SP prediction. Human studies show strong preference for CLIP-UNet's SP mask predictions over other baselines. The predicted SP masks enable a robot to perform an Embodied Semantic Placement (ESP) task in a simulated environment, demonstrating downstream applicability.	The approach is limited by the performance of open-vocabulary detectors, segmentation models, and inpainting models used in data generation, which can introduce biases. Zero-shot deployment for tasks like ESP can result in predictions not feasible for the robot's physical capabilities, necessitating embodiment-aware finetuning.	semantic placement, computer vision, vision and language, robotics, common sense reasoning
2401.07727 Report	HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation	Antoine Mercier, Ramin Nakhli, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger	Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.	HexaGen3D is a novel text-to-3D model that generates textured meshes from text prompts in 7 seconds, leveraging pretrained text-to-image diffusion models.	Efficient generation of high-quality 3D assets from text is crucial for various industries but remains challenging due to data scarcity. Existing methods are either slow or lack quality/diversity.	HexaGen3D finetunes a pretrained text-to-image model to predict six orthographic projections (hexaview), then maps these to a triplanar latent representation, finally decoded into a textured mesh. It introduces 'orthographic hexaview guidance' for 3D consistency and uses a novel layout converter for hexaview-to-triplane mapping.	HexaGen3D achieves competitive quality to state-of-the-art methods while being significantly faster (7 seconds vs. 20 minutes to 3 hours). It demonstrates superior diversity across generated samples compared to methods like DreamFusion and MVDream. The approach shows strong generalization to unseen objects and compositions.	Generated meshes can occasionally exhibit box artifacts or struggle with intricate structures. Future work will focus on refining the VAE pipeline and exploring the impact of larger 3D datasets.	text-to-3d, diffusion models, generative models, 3d asset creation, multi-view synthesis
2401.07709 Report	Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks	Siyu Zou, Jiji Tang, Yiyi Zhou, Jing He, Chaoyi Zhao, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun	Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times.	This paper proposes InstDiffEdit, a novel and efficient image editing method for text-to-image diffusion models that uses cross-modal attention for instant mask guidance during diffusion steps.	Existing diffusion-based image editing methods often rely on manual or offline mask generation, limiting their efficiency. InstDiffEdit aims to automate this process and improve speed.	InstDiffEdit leverages the cross-modal attention maps within diffusion models to generate masks instantly. It incorporates a training-free refinement scheme to reduce noise and adaptively aggregate attention distributions for accurate mask generation. Finally, it uses the generated mask for inpainting-based editing, ensuring global semantic consistency.	InstDiffEdit achieves state-of-the-art performance on ImageNet and Imagen datasets, demonstrating a superior trade-off between computation efficiency and generation quality. Compared to the current leading method, DiffEdit, InstDiffEdit achieves 5 to 6 times faster inference speed while producing better masks and editing results. A new benchmark called Editing-Mask, containing 200 images with human-labeled masks, is introduced to evaluate the local editing ability and mask accuracy of different methods, further confirming the superiority of InstDiffEdit in background preservation.	The performance of InstDiffEdit might be affected by the complexity and quality of input images and text prompts. Further exploration of more sophisticated refinement techniques for attention maps could potentially lead to even better mask accuracy and editing results. Future work could investigate extending InstDiffEdit to other diffusion model architectures beyond Stable Diffusion.	image editing, diffusion models, text-to-image synthesis, cross-modal attention, semantic image manipulation
2401.07519 Report	InstantID: Zero-shot Identity-Preserving Generation in Seconds	Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu	There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.	Introduces InstantID, a plug-and-play module for pre-trained text-to-image diffusion models, enabling zero-shot identity-preserving image generation using a single facial image.	Addresses limitations of existing methods that are either resource-intensive, require multiple reference images, or lack fidelity in preserving facial details.	Combines an ID embedding from a pre-trained face model with a lightweight image adapter and a novel IdentityNet for encoding detailed facial features with spatial control.	Achieves high fidelity in ID preservation with a single reference image, surpassing or matching the performance of training-based methods. Maintains text editing capabilities of the original diffusion model, allowing for style variations while preserving identity. Demonstrates compatibility with existing ControlNet models for spatial control and seamlessly integrates with pre-trained models like SD1.5 and SDXL.	Facial attribute features are highly coupled in the ID embedding, limiting flexibility in face editing. Potential biases from the pre-trained face model might impact the generated images.	image synthesis, identity preservation, diffusion models, zero-shot learning, image customization
2401.06994 Report	UniVision: A Unified Framework for Vision-Centric 3D Perception	Yu Hong, Qian Liu, Huayuan Cheng, Danjiao Ma, Hang Dai, Yu Wang, Guangzhi Cao, Yong Ding	The past few years have witnessed the rapid development of vision-centric 3D perception in autonomous driving. Although the 3D perception models share many structural and conceptual similarities, there still exist gaps in their feature representations, data formats, and objectives, posing challenges for unified and efficient 3D perception framework design. In this paper, we present UniVision, a simple and efficient framework that unifies two major tasks in vision-centric 3D perception, \ie, occupancy prediction and object detection. Specifically, we propose an explicit-implicit view transform module for complementary 2D-3D feature transformation. We propose a local-global feature extraction and fusion module for efficient and adaptive voxel and BEV feature extraction, enhancement, and interaction. Further, we propose a joint occupancy-detection data augmentation strategy and a progressive loss weight adjustment strategy which enables the efficiency and stability of the multi-task framework training. We conduct extensive experiments for different perception tasks on four public benchmarks, including nuScenes LiDAR segmentation, nuScenes detection, OpenOccupancy, and Occ3D. UniVision achieves state-of-the-art results with +1.5 mIoU, +1.8 NDS, +1.5 mIoU, and +1.8 mIoU gains on each benchmark, respectively. We believe that the UniVision framework can serve as a high-performance baseline for the unified vision-centric 3D perception task. The code will be available at \url{https://github.com/Cc-Hy/UniVision}.	UniVision, a simple and efficient framework unifying 3D object detection and occupancy prediction for vision-centric autonomous driving.	Existing 3D perception models have gaps in feature representations, data formats, and objectives, making unified framework design challenging.	UniVision utilizes an explicit-implicit view transform module, local-global feature extraction and fusion, joint occupancy-detection augmentation, and progressive loss weight adjustment.	Achieves state-of-the-art results on nuScenes LiDAR segmentation, surpassing previous best by +1.5 mIoU. Outperforms state-of-the-art methods on nuScenes detection, with a significant +1.8 NDS gain. Sets new records on OpenOccupancy and Occ3D benchmarks with +1.5 mIoU and +1.8 mIoU gains respectively.	Current version of UniVision does not incorporate temporal information. Joint augmentation strategy currently relies on sampling and interpolation, which might introduce artifacts.	3d object detection, occupancy prediction, autonomous driving, vision-centric perception, multi-task learning
2401.06805 Report	Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning	Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang	Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.	This paper presents a comprehensive survey of Multimodal Large Language Models (MLLMs) focusing on their reasoning capabilities.	Reasoning is a key aspect of intelligence and crucial for developing AGI, making the investigation of MLLMs' reasoning abilities essential.	The paper reviews definitions and protocols for evaluating reasoning, different types of reasoning tasks, MLLM architectures, and the role of instruction tuning in facilitating reasoning. It analyzes the performance of various MLLMs on established benchmarks.	MLLMs struggle with complex reasoning tasks requiring multi-step inference and domain knowledge. Instruction tuning significantly improves MLLMs' reasoning capabilities but faces challenges in multimodal prompting. Top-performing open-source MLLMs often employ three-stage training, leverage multi-task supervised learning, and benefit from improved visual representations.	The analysis primarily focuses on top-performing models and a subset of reasoning-focused benchmarks, limiting the generalizability of findings. Future research should address limitations in MLLM architectures, develop efficient training methods, and create more comprehensive evaluation benchmarks, particularly for long-context and multi-round conversational scenarios.	multimodal reasoning, multimodal large language models, instruction tuning, benchmark analysis, future directions
2401.06704 Report	Scalable 3D Panoptic Segmentation As Superpoint Graph Clustering	Damien Robert, Hugo Raguet, Loic Landrieu	We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inference. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: $50.1$ PQ ($+7.8$) for S3DIS Area~5, and $58.7$ PQ ($+25.2$) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only $209$k parameters, our model is over $30$ times smaller than the best-competing method and trains up to $15$ times faster. Our code and pretrained models are available at https://github.com/drprojects/superpoint_transformer.	SuperCluster, a novel method for efficient and scalable 3D panoptic segmentation of large point clouds, redefines the task as a graph clustering problem.	Large-scale 3D environment understanding is crucial for various applications like 'digital twins' and city digitization, requiring scalable models to process massive point clouds and identify objects.	The method uses a neural network to predict semantic classes and object agreement for points (or superpoints). These predictions are used as parameters in a graph clustering problem, grouping points into object instances. Crucially, the model is trained using only local auxiliary tasks, eliminating computationally expensive instance matching during training.	SuperCluster achieves state-of-the-art panoptic segmentation on S3DIS and ScanNet, significantly outperforming previous methods. It sets the first panoptic segmentation benchmark for large-scale datasets KITTI-360 and DALES. SuperCluster is extremely efficient, using a small network and training up to 15 times faster than competitors.	The graph clustering function is non-differentiable, preventing end-to-end learning. Superpoint partitioning can be sensitive to low point density in sparse scans.	3d panoptic segmentation, graph clustering, point cloud processing, large-scale 3d, superpoints
2401.06637 Report	Adversarial Examples are Misaligned in Diffusion Model Manifolds	Peter Lorenz, Ricard Durall, Janis Keuper	In recent years, diffusion models (DMs) have drawn significant attention for their success in approximating data distributions, yielding state-of-the-art generative results. Nevertheless, the versatility of these models extends beyond their generative capabilities to encompass various vision applications, such as image inpainting, segmentation, adversarial robustness, among others. This study is dedicated to the investigation of adversarial attacks through the lens of diffusion models. However, our objective does not involve enhancing the adversarial robustness of image classifiers. Instead, our focus lies in utilizing the diffusion model to detect and analyze the anomalies introduced by these attacks on images. To that end, we systematically examine the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models. The efficacy of this approach is assessed across CIFAR-10 and ImageNet datasets, including varying image sizes in the latter. The results demonstrate a notable capacity to discriminate effectively between benign and attacked images, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.	This paper presents a novel method for detecting adversarial examples in images using diffusion models (DMs). The key idea is to leverage the DM's ability to learn the manifold of natural images and exploit the fact that adversarial examples often lie outside this manifold. This results in distinct patterns in transformed adversarial images, which can be learned by a simple binary classifier.	Detecting adversarial examples is crucial for deploying deep learning models in security-sensitive applications, as these examples can lead to misclassifications and system vulnerabilities. Existing defense mechanisms often struggle with high-resolution images and adaptive attacks. This method offers a new approach to address this challenge by using the transformative capabilities of DMs.	The method involves the following steps: 1) Apply a pre-trained DM to transform both benign and adversarial images using the inversion and reverse process. 2) Train a binary classifier (ResNet-50 or ResNet-18) on the transformed images to distinguish between adversarial and benign samples.	The method achieves high detection accuracy (AUC, ACC > 95%) across various white-box and black-box attacks on CIFAR-10 and ImageNet datasets, including high-resolution images (512x512 pixels). Analysis of the transformed images suggests that adversarial perturbations introduce detectable patterns in the DM's reverse process, even after multiple transformations. While the method shows promising results, its transferability to unseen attacks is limited, suggesting it acts as a complementary defense mechanism rather than a standalone solution.	The method's reliance on pre-trained DMs limits its effectiveness against adaptive attacks that modify their strategies during test time. The transferability of the learned patterns to unseen attacks is limited, necessitating further research on improving generalization.	adversarial examples, diffusion models, adversarial detection, image classification, deep learning
2401.06614 Report	Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking	Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, Jiapeng Tang	We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead, we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.	Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from sparse, noisy, or partial point cloud sequences.	Existing feed-forward networks struggle with ambiguous observations from imperfect point clouds, and conventional 4D representations fail to capture accurate shape and motion priors.	The model leverages a two-stage approach, first learning shape and deformation priors with autoencoders and then utilizing these priors in a diffusion model to reconstruct 4D surfaces. It employs latent sets for shape and deformation, enabling local representation, and an interleaved spatio-temporal attention mechanism for efficient and temporally consistent diffusion.	Motion2VecSets reconstructs more plausible and accurate surfaces compared to previous state-of-the-art methods, especially in challenging scenarios with sparse or partial inputs. The model exhibits superior generalization ability to unseen motions and object identities, thanks to the local representation power of latent sets. Synchronized diffusion of deformation latent sets, facilitated by the interleaved spatio-temporal attention mechanism, ensures robust temporal coherence in reconstructed 4D surfaces.	The current implementation has a relatively long inference time, limiting real-time applications. The model's focus on single-view reconstruction could be extended to multi-view scenarios for more comprehensive 4D capture.	4d reconstruction, diffusion model, dynamic surface, point cloud sequences, latent sets
2401.06578 Report	360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model	Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, Jian Zhang	Panorama video recently attracts more interest in both study and application, courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos, generating desirable panorama videos by prompts is urgently required. Lately, the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However, due to the significant gap in content and motion patterns between panoramic and standard videos, these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper, we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically, we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD, addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation. Our project page is at https://akaneqwq.github.io/360DVD/.	Introduces 360DVD, a controllable 360-degree panorama video generation diffusion model, by adapting a standard T2V model with a lightweight 360-Adapter.	Existing text-to-video (T2V) diffusion models struggle to generate satisfactory 360-degree panoramic videos due to the distinct content and motion patterns compared to standard videos. This necessitates a dedicated approach.	Leverages a pre-trained denoising U-Net with a trainable 360-Adapter to capture panoramic characteristics. Employs 360 Enhancement Techniques, including a latitude-aware loss and mechanisms for continuity, to enhance quality. Introduces WEB360, a new dataset of panoramic videos with detailed captions using a GPT-based 360 Text Fusion module.	Generates text-aligned and coherent 360-degree panorama videos with high quality and diverse styles. Successfully incorporates motion guidance, enabling control over video dynamics. Outperforms baseline methods in terms of graphics quality, frame consistency, and adherence to panoramic video characteristics based on user studies.	Performance relies on the underlying T2V model, limiting capabilities due to frozen parameters during training. Reliance on predicted motion conditions from a panoramic optical flow estimator introduces limitations due to the estimator's performance. Future work includes exploring the use of other motion conditions such as depth maps and expanding control beyond optical flow.	panorama video generation, text-to-video synthesis, diffusion models, 360-degree videos, motion guidance
2401.06442 Report	RotationDrag: Point-based Image Editing with Rotated Diffusion Features	Minxing Luo, Wentao Cheng, Jian Yang	A precise and user-friendly manipulation of image content while preserving image fidelity has always been crucial to the field of image editing. Thanks to the power of generative models, recent point-based image editing methods allow users to interactively change the image content with high generalizability by clicking several control points. But the above mentioned editing process is usually based on the assumption that features stay constant in the motion supervision step from initial to target points. In this work, we conduct a comprehensive investigation in the feature space of diffusion models, and find that features change acutely under in-plane rotation. Based on this, we propose a novel approach named RotationDrag, which significantly improves point-based image editing performance when users intend to in-plane rotate the image content. Our method tracks handle points more precisely by utilizing the feature map of the rotated images, thus ensuring precise optimization and high image fidelity. Furthermore, we build a in-plane rotation focused benchmark called RotateBench, the first benchmark to evaluate the performance of point-based image editing method under in-plane rotation scenario on both real images and generated images. A thorough user study demonstrates the superior capability in accomplishing in-plane rotation that users intend to achieve, comparing the DragDiffusion baseline and other existing diffusion-based methods. See the project page https://github.com/Tony-Lowe/RotationDrag for code and experiment results.	RotationDrag, a novel point-based image editing method leveraging rotated diffusion features to enhance accuracy in image manipulation, particularly under in-plane rotation scenarios.	Existing point-based editing methods assume feature constancy during motion supervision, leading to inaccurate edits, especially during rotations which are common in user edits. RotationDrag addresses this by using features from rotated images for precise point tracking.	RotationDrag calculates rotation angles between initial and current handle points during optimization. It then rotates the input image accordingly and utilizes the feature map of this rotated image for accurate point tracking and motion supervision.	RotationDrag demonstrates superior performance in rotating and dragging image content compared to DragDiffusion, FreeDrag, and SDE-Drag. A user study confirms RotationDrag's significantly better performance in achieving desired in-plane rotations. The paper introduces RotationBench, a new benchmark dataset focused on evaluating in-plane rotation in image editing.	RotationDrag's reliance on repeated inversions during point tracking impacts its speed compared to DragDiffusion. Future work will explore improving Stable Diffusion's handling of rotations to potentially enhance speed.	point-based image editing, stable diffusion, diffusion models, rotation invariance, image manipulation
2401.06345 Report	Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering	Chang Yu, Junran Peng, Xiangyu Zhu, Zhaoxiang Zhang, Qi Tian, Zhen Lei	The text-to-image synthesis by diffusion models has recently shown remarkable performance in generating high-quality images. Although performs well for simple texts, the models may get confused when faced with complex texts that contain multiple objects or spatial relationships. To get the desired images, a feasible way is to manually adjust the textual descriptions, i.e., narrating the texts or adding some words, which is labor-consuming. In this paper, we propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. By utilizing the quality guidance and the semantic guidance derived from the pre-trained diffusion model, our method can effectively learn the prompts to improve the matches between the input text and the generated images. Extensive experiments and analyses have validated the effectiveness of the proposed method.	This paper introduces a novel framework leveraging prompt engineering to enhance text-to-image synthesis in diffusion models, specifically targeting improved accuracy for complex textual descriptions.	Existing diffusion models often struggle to accurately synthesize images from complex text descriptions containing multiple objects or spatial relationships. This work addresses this limitation by learning appropriate textual prompts that guide the model to generate more accurate images.	The proposed two-stage framework utilizes a pre-trained diffusion model. It first generates coarse and fine images from the input text. It then learns text-specific prompts guided by minimizing the difference between text/image embeddings and promoting consistency between the generated images and the input text, as well as sparsity in the learned prompts.	The method effectively learns prompts that improve text-image matching and reduce artifacts in synthesized images. It outperforms existing methods like Composable Diffusion and Structure Diffusion in synthesizing images from both composable and relational text descriptions. Visualizations of cross-attention maps demonstrate that the learned prompts help the model focus on previously neglected objects or relationships in the text, leading to more accurate image generation.	The method relies on pre-trained diffusion models and doesn't involve fine-tuning the model itself, which could potentially limit the extent of improvement. The paper primarily focuses on generating images from complex texts, and further investigation is needed to evaluate its efficacy in other text-to-image synthesis scenarios.	text-to-image synthesis, diffusion models, prompt engineering, composable diffusion, relational text
2401.06341 Report	AffordanceLLM: Grounding Affordance from Vision Language Models	Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li	Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/	Introduces \methodname, a novel affordance grounding approach leveraging world knowledge from pretrained Vision Language Models (VLMs) to improve generalization to unseen objects.	Affordance grounding, crucial for embodied AI tasks like robot manipulation, struggles to generalize to novel objects unseen during training.	Extends a VLM (LLaVA) with a mask decoder to predict affordance maps from images and text prompts. Incorporates pseudodepth maps as input to enhance 3D understanding.	Significantly outperforms state-of-the-art baselines on AGD20K, especially on a newly proposed 'Hard' split designed to test generalization. Demonstrates successful affordance grounding on random Internet images with novel objects and actions. Shows the importance of both appropriate text prompts and the visual grounding capability of the image encoder.	Can struggle with ambiguous situations or scenes with multiple objects. Reliance on pretrained VLMs introduces potential biases and limitations.	affordance grounding, vision language models, generalization, 3d understanding, robot manipulation
2401.06310 Report	ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation	Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K. Reddy, Sunipa Dev	Recent studies have shown that Text-to-Image (T2I) model generations can reflect social stereotypes present in the real world. However, existing approaches for evaluating stereotypes have a noticeable lack of coverage of global identity groups and their associated stereotypes. To address this gap, we introduce the ViSAGe (Visual Stereotypes Around the Globe) dataset to enable the evaluation of known nationality-based stereotypes in T2I models, across 135 nationalities. We enrich an existing textual stereotype resource by distinguishing between stereotypical associations that are more likely to have visual depictions, such as `sombrero', from those that are less visually concrete, such as 'attractive'. We demonstrate ViSAGe's utility through a multi-faceted evaluation of T2I generations. First, we show that stereotypical attributes in ViSAGe are thrice as likely to be present in generated images of corresponding identities as compared to other attributes, and that the offensiveness of these depictions is especially higher for identities from Africa, South America, and South East Asia. Second, we assess the stereotypical pull of visual depictions of identity groups, which reveals how the 'default' representations of all identity groups in ViSAGe have a pull towards stereotypical depictions, and that this pull is even more prominent for identity groups from the Global South. CONTENT WARNING: Some examples contain offensive stereotypes.	This paper introduces ViSAGe, a dataset for evaluating nationality-based stereotypes in Text-to-Image models, covering 135 nationalities, by distinguishing visually depicted stereotypes from those less visually concrete.	Existing approaches lack global coverage in evaluating social stereotypes in T2I models, making it crucial to develop methods to assess and mitigate potential harm, particularly for marginalized groups.	The authors enriched an existing textual stereotype resource by identifying visually depictable stereotypes. They conducted large-scale human annotations and explored automated methods using CLIP to detect stereotypes in images generated by Stable Diffusion.	Stereotypical attributes are three times more likely to be present in generated images compared to non-stereotypical attributes. Offensive depictions are particularly high for identities from Africa, South America, and Southeast Asia. T2I models exhibit a 'stereotypical pull', generating images aligning with stereotypes even when prompted otherwise, especially for Global South identities.	Annotation of visual stereotypes can be subjective, potentially missing nuances. Evaluation is limited by stereotypes present in the initial textual resource (SeeGULL), necessitating inclusion from other sources.	stereotype evaluation, text-to-image models, visage dataset, global stereotypes, bias in ai
2401.06209 Report	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie	Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.	This paper introduces the Multimodal Visual Patterns (MVP) benchmark to expose the systematic visual shortcomings of Multimodal Large Language Models (MLLMs) like GPT-4V, particularly in visual grounding.	Despite advancements in MLLMs, their visual component often relies on instance-level contrastive learning (e.g., CLIP), leading to fundamental visual understanding errors.	The authors identify "CLIP-blind pairs" - images visually different but perceived as similar by CLIP. Using these pairs, they construct the MVP benchmark with straightforward VQA questions targeting these visual discrepancies. They also analyze systematic failure patterns in CLIP across various model scales and correlate them to MLLM errors. Finally, they propose Mixture-of-Features (MoF) approaches to enhance MLLM visual grounding.	Human evaluation confirms the MVP benchmark questions are straightforward, achieving 95.7% accuracy, while MLLMs, even GPT-4V, struggle significantly. Scaling up CLIP model size and data only marginally improves performance on two out of nine identified visual patterns. A strong correlation exists between CLIP's visual pattern recognition errors and the performance of MLLMs, indicating CLIP as a bottleneck.	The study primarily focuses on CLIP-based MLLMs, potentially limiting generalizability to other architectures. While MoF shows promise, further exploration is needed to optimize feature integration and balance visual grounding with other capabilities.	multimodal learning, visual grounding, large language models, benchmarking, visual representation learning
2401.06197 Report	Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications	Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai	We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.	This paper introduces Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications.	Despite the advantages of DCN, it is not the go-to solution for vision backbone models due to its slow speed and counter-intuitive slower convergence compared to global attention at the initial backbone training phase. This work aims to address these limitations.	The authors improve DCNv4 by 1) removing the softmax normalization in spatial aggregation and 2) optimizing memory access to minimize redundant operations for speedup.	DCNv4 converges significantly faster than DCNv3 (its predecessor). It accelerates forward speed by more than 3 times. DCNv4 achieves performance improvement in various tasks, including image classification, instance and semantic segmentation, and image generation.	The header parts in some experiments (e.g., BEVFormer v2 for 3D object detection) are underoptimized. The architecture/hyperparameters might not be optimal for DCNv4 in some cases.	deformable convolution, vision backbones, operator optimization, image classification, object detection, image generation
2401.06191 Report	TriNeRFLet: A Wavelet Based Multiscale Triplane NeRF Representation	Rajaei Khatib, Raja Giryes	In recent years, the neural radiance field (NeRF) model has gained popularity due to its ability to recover complex 3D scenes. Following its success, many approaches proposed different NeRF representations in order to further improve both runtime and performance. One such example is Triplane, in which NeRF is represented using three 2D feature planes. This enables easily using existing 2D neural networks in this framework, e.g., to generate the three planes. Despite its advantage, the triplane representation lagged behind in its 3D recovery quality compared to NeRF solutions. In this work, we propose TriNeRFLet, a 2D wavelet-based multiscale triplane representation for NeRF, which closes the 3D recovery performance gap and is competitive with current state-of-the-art methods. Building upon the triplane framework, we also propose a novel super-resolution (SR) technique that combines a diffusion model with TriNeRFLet for improving NeRF resolution.	This paper introduces TriNeRFLet, a novel NeRF representation based on a 2D wavelet multiscale triplane structure. It also proposes a super-resolution (SR) technique that combines a diffusion model with TriNeRFLet for improving NeRF resolution.	Triplane, while efficient due to its 2D structure, lagged in 3D recovery quality compared to other methods. TriNeRFLet aims to close this gap.	TriNeRFLet represents NeRF using multiscale 2D wavelet features, regularizing them to be sparse. It utilizes a coarse-to-fine training strategy. For SR, it leverages the multiscale structure to combine a pre-trained diffusion model with a low-resolution TriNeRFLet to generate high-resolution novel views.	TriNeRFLet closes the performance gap of Triplane, achieving competitive 3D reconstruction quality compared to state-of-the-art methods like INGP and 3D Gaussian Splatting. The proposed SR technique outperforms other 2D supervised NeRF SR methods in most experiments on the Blender dataset. For LLFF dataset, TriNeRFLet SR achieves comparable or better results than state-of-the-art methods, demonstrating its effectiveness on real-world scenes.	Training TriNeRFLet is more time-consuming than INGP due to the wavelet reconstruction step. The diffusion-based SR model currently used only supports specific upscale factors, requiring padding or cropping for other resolutions.	nerf, neural radiance field, triplane, wavelet, super-resolution
2401.06129 Report	Distilling Vision-Language Models on Millions of Videos	Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan	The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.	This paper proposes a method for adapting image-based vision-language models (VLMs) to the video domain and uses the adapted VLM to generate high-quality captions for a large-scale video dataset.	This is important because there is a lack of high-quality, large-scale video-text data which is crucial for training effective video-language models.	The adaptation is done in two stages: (1) visual adaptation by fine-tuning the visual encoder on video captioning data, and (2) language adaptation by fine-tuning the language model on instruction-following data. The adapted VLM is then used to generate captions for a large-scale web-scraped video dataset.	The adapted VLM achieves state-of-the-art zero-shot performance on various video-language benchmarks, including video question answering and captioning. The generated captions are of high quality and lead to significant improvements when used to train a video-language dual-encoder model. The approach demonstrates a positive scaling effect, with performance increasing as more pseudo-captioned video data is used.	One limitation is the reliance on existing video-text datasets for adaptation, which are still limited in scale and diversity compared to image-text datasets. Further improvements might be achieved by exploring alternative methods for generating instruction-following data and by developing more sophisticated techniques for self-training with pseudo-captioned videos.	video-language models, captioning, instruction tuning, pseudo-labeling, zero-shot learning
2401.06105 Report	PALP: Prompt Aligned Personalization of Text-to-Image Models	Moab Arar, Andrey Voynov, Amir Hertz, Omri Avrahami, Shlomi Fruchter, Yael Pritch, Daniel Cohen-Or, Ariel Shamir	Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.	The paper introduces PALP, a novel personalization method for text-to-image diffusion models that excels in aligning generated images with complex user prompts.	Existing personalization methods often struggle to balance subject fidelity and adherence to intricate prompts, limiting their ability to fulfill user demands for creative image generation.	PALP employs a two-pronged approach: fine-tuning a pre-trained model to learn the subject's unique features and using score distillation sampling to guide the model's noise predictions towards the target prompt.	PALP demonstrates superior prompt alignment compared to existing methods while preserving high subject fidelity. The method proves effective in both multi-shot and single-shot settings, enabling personalization even with a single reference image. PALP allows for multi-subject personalization, enabling the creation of coherent scenes with multiple subjects or artistic compositions inspired by a single artwork.	The current approach requires personalization for each specific prompt, limiting its real-time applicability. Future work could explore prompt-aligned adapters for instant personalization on specific prompts or extend the method to excel on subsets of prompts for specialized applications.	text-to-image synthesis, personalization, prompt alignment, diffusion models, score distillation sampling
2401.06104 Report	Transformers are Multi-State RNNs	Matanel Oren, Michael Hassid, Yossi Adi, Roy Schwartz	Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into $\textit{finite}$ multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only $\frac{1}{8}$ of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.	This paper redefines decoder-only transformers as infinite multi-state RNNs and proposes a new compression method, TOVA, to convert them into finite multi-state RNNs.	This work is important because it provides a new perspective on the relationship between transformers and RNNs, and proposes a practical method for reducing the memory footprint of LLMs during inference.	The authors formally define multi-state RNNs and demonstrate how transformers can be conceptualized as a special case. They then propose TOVA, a compression policy that leverages attention scores to determine which tokens to keep in the multi-state.	TOVA outperforms other compression policies on language modeling, achieving comparable perplexity to the full model using only 1/8 - 1/4 of the context. On long-range understanding tasks, TOVA consistently outperforms baselines and achieves near-topline performance with a reduced multi-state size. For text generation, TOVA enables using smaller multi-state sizes with minimal impact on story quality compared to the full model.	Evaluating long text generation is computationally expensive and relies on GPT-4 for comparison, which has its own limitations. The evaluation is focused on English, and the findings might not directly transfer to languages with different word order characteristics.	transformers, rnns, language models, memory compression, long-range dependencies
2401.06071 Report	GroundingGPT:Language Enhanced Multi-modal Grounding Model	Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang	Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/GroundingGPT.	This paper proposes GroundingGPT, an end-to-end multi-modal grounding model for fine-grained understanding and grounding tasks across image, video, and audio.	Existing multi-modal large language models (MLLMs) often prioritize global information, neglecting fine-grained details crucial for grounding tasks.	The paper uses modality-specific adapters to map features to LLM embedding space, represents coordinates/timestamps textually, and employs a three-stage coarse-to-fine training strategy with a diversified multi-granularity dataset.	GroundingGPT achieves impressive results in multi-modal grounding tasks like referring expression comprehension and temporal video grounding. The model maintains or improves multi-modal understanding abilities, excelling in tasks like visual question answering and video question answering. GroundingGPT effectively suppresses object hallucination, indicating enhanced local detail comprehension.	The sampling strategy for videos and audios might lead to information loss. Current training predominantly focuses on single-modal inputs, limiting performance on simultaneous multi-modal grounding tasks.	multi-modal grounding, large language models, fine-grained understanding, coarse-to-fine training, object hallucination
2401.06035 Report	RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks	Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf	We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy reduces computational complexity by a factor of $2$ as measured in FLOPs. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model is capable of synthesizing high-fidelity video clips at a resolution of $256\times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips.	Introduces a novel unconditional video generation model using a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks, enabling efficient and temporally coherent video generation.	Addresses limitations of autoregressive models in unconditional video generation, particularly the accumulation of errors and challenges in capturing long-term spatial and temporal dependencies.	Adapts the tri-plane representation from 3D object modeling to video data, organizing features into three planar grids aligned with spatial and temporal axes. Employs a StyleGAN-T backbone to generate tri-plane features and incorporates optical flow for explicit motion modeling, enhancing feature consistency over time. Utilizes double discrimination with separate discriminators for individual frames and the entire video to enhance training effectiveness.	Generates high-fidelity video clips at 256x256 resolution with durations exceeding 5 seconds at 30 fps. Demonstrates superior performance in capturing long-range spatial and temporal dependencies compared to state-of-the-art GAN-based approaches (StyleGAN-V, MoCoGAN). Exhibits significant computational efficiency, requiring less than half the FLOPs of other SOTA models for generating a 160-frame video sample.	Performance heavily reliant on the capacity of the generative backbone network, with limitations observed when using less expansive StyleGAN versions. Current implementation lacks explicit disentanglement of objects within generated scenes, limiting control over individual elements.	video generation, tri-plane representation, optical flow, generative adversarial networks (gans), long-term temporal dependencies
2401.06003 Report	TRIPS: Trilinear Point Splatting for Real-Time Radiance Field Rendering	Linus Franke, Darius Rückert, Laura Fink, Marc Stamminger	Point-based radiance field rendering has demonstrated impressive results for novel view synthesis, offering a compelling blend of rendering quality and computational efficiency. However, also latest approaches in this domain are not without their shortcomings. 3D Gaussian Splatting [Kerbl and Kopanas et al. 2023] struggles when tasked with rendering highly detailed scenes, due to blurring and cloudy artifacts. On the other hand, ADOP [R\"uckert et al. 2022] can accommodate crisper images, but the neural reconstruction network decreases performance, it grapples with temporal instability and it is unable to effectively address large gaps in the point cloud. In this paper, we present TRIPS (Trilinear Point Splatting), an approach that combines ideas from both Gaussian Splatting and ADOP. The fundamental concept behind our novel technique involves rasterizing points into a screen-space image pyramid, with the selection of the pyramid layer determined by the projected point size. This approach allows rendering arbitrarily large points using a single trilinear write. A lightweight neural network is then used to reconstruct a hole-free image including detail beyond splat resolution. Importantly, our render pipeline is entirely differentiable, allowing for automatic optimization of both point sizes and positions. Our evaluation demonstrate that TRIPS surpasses existing state-of-the-art methods in terms of rendering quality while maintaining a real-time frame rate of 60 frames per second on readily available hardware. This performance extends to challenging scenarios, such as scenes featuring intricate geometry, expansive landscapes, and auto-exposed footage. The project page is located at: https://lfranke.github.io/trips/	TRIPS, a novel point-based radiance field rendering method that uses trilinear splatting into an image pyramid to achieve real-time performance and high visual quality.	Existing point-based radiance field rendering methods either struggle with details (3D Gaussian Splatting) or temporal instability and hole filling (ADOP).	Points are splatted trilinearly into an image pyramid based on projected size. A lightweight neural network then reconstructs a hole-free, detailed image from the pyramid.	TRIPS achieves superior visual quality compared to 3D Gaussian Splatting, particularly in detail rendering. Outperforms ADOP in filling large gaps and maintaining temporal consistency. Maintains real-time rendering (60 FPS) on a single RTX 4090, even with large point clouds.	Requires a dense initial point cloud reconstruction. Lacks anisotropic splatting, leading to potential artifacts with thin structures.	neural rendering, point-based rendering, radiance fields, novel view synthesis, real-time rendering
2401.05925 Report	CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion	Bin Dou, Tianyu Zhang, Yongjia Ma, Zhaohui Wang, Zejian Yuan	We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians' segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians	This paper proposes CoSSegGaussians, a method for achieving compact and fast 3D scene segmentation using only RGB images as input.	Existing Gaussian-based scene segmentation methods struggle to produce compact masks, especially in zero-shot scenarios due to inconsistencies in 2D machine-generated labels.	The method leverages 3D Gaussian Splatting for scene representation and employs a Dual Feature Fusion Network. It unprojects multi-scale DINO features onto 3D Gaussians and combines them with spatial features extracted using RandLA-Net. A global-to-local aggregation module then generates compact segmentation logits.	CoSSegGaussians outperforms baselines on both semantic and panoptic zero-shot segmentation tasks. It achieves significantly faster rendering speed than NeRF-based segmentation methods. The method produces more compact segmentation masks compared to previous Gaussian-based methods.	High GPU occupancy during training due to the large number of Gaussian points. Lack of explicit structural constraints during training.	scene segmentation, zero-shot learning, 3d gaussian splatting, dino features, spatial feature aggregation
2401.05907 Report	Efficient Image Deblurring Networks based on Diffusion Models	Kang Chen, Yuanjie Liu	This article introduces a sliding window model for defocus deblurring that achieves the best performance to date with extremely low memory usage. Named Swintormer, the method utilizes a diffusion model to generate latent prior features that assist in restoring more detailed images. It also extends the sliding window strategy to specialized Transformer blocks for efficient inference. Additionally, we have further optimized Multiply-Accumulate operations (Macs). Compared to the currently top-performing GRL method, our Swintormer model drastically reduces computational complexity from 140.35 GMACs to 8.02 GMacs, while also improving the Signal-to-Noise Ratio (SNR) for defocus deblurring from 27.04 dB to 27.07 dB. This new method allows for the processing of higher resolution images on devices with limited memory, significantly expanding potential application scenarios. The article concludes with an ablation study that provides an in-depth analysis of the impact of each network module on final performance. The source code and model will be available at the following website: https://github.com/bnm6900030/swintormer.	This paper introduces Swintormer, a sliding window Transformer model for image deblurring that integrates a diffusion model to generate latent prior features, improving deblurring quality with low memory usage.	Existing supervised image deblurring methods require large labeled datasets and often lack generalization ability. This paper leverages the power of pre-trained diffusion models to address these limitations.	The method employs a pre-trained diffusion model fine-tuned for the deblurring task to generate latent image features. These features are then used as input along with the blurry image to train a memory-efficient sliding window Transformer model.	Swintormer achieves state-of-the-art performance on defocus deblurring benchmarks like DPDD, outperforming previous methods in PSNR and LPIPS. The use of latent features from the diffusion model significantly improves deblurring quality, particularly in challenging outdoor scenes. The proposed sliding window approach with shifted windows and mixed attention mechanism allows for efficient inference on high-resolution images with low computational complexity (MACs).	The paper focuses primarily on defocus and motion deblurring; further exploration is needed for other deblurring types. Future work will investigate more efficient architectures and training strategies for the diffusion model to further reduce computational cost.	image deblurring, diffusion models, transformer, sliding window, low memory
2401.05750 Report	GO-NeRF: Generating Virtual Objects in Neural Radiance Fields	Peng Dai, Feitong Tan, Xin Yu, Yinda Zhang, Xiaojuan Qi	Despite advances in 3D generation, the direct creation of 3D objects within an existing 3D scene represented as NeRF remains underexplored. This process requires not only high-quality 3D object generation but also seamless composition of the generated 3D content into the existing NeRF. To this end, we propose a new method, GO-NeRF, capable of utilizing scene context for high-quality and harmonious 3D object generation within an existing NeRF. Our method employs a compositional rendering formulation that allows the generated 3D objects to be seamlessly composited into the scene utilizing learned 3D-aware opacity maps without introducing unintended scene modification. Moreover, we also develop tailored optimization objectives and training strategies to enhance the model's ability to exploit scene context and mitigate artifacts, such as floaters, originating from 3D object generation within a scene. Extensive experiments on both feed-forward and $360^o$ scenes show the superior performance of our proposed GO-NeRF in generating objects harmoniously composited with surrounding scenes and synthesizing high-quality novel view images. Project page at {\url{https://daipengwa.github.io/GO-NeRF/}.	GO-NeRF, a novel pipeline that generates context-aware 3D virtual objects from text prompts and seamlessly integrates them into pre-trained NeRF scenes.	Enables novel scene creation and editing by harmoniously compositing generated 3D objects into existing environments, enhancing immersion in applications like VR.	Uses a compositional rendering formulation with a separate object NeRF and 3D-aware opacity maps for seamless composition. Employs context-aware learning objectives, including inpainting priors and saturation regularization, for high-quality, scene-consistent object generation.	Generates high-quality, context-aware 3D objects within existing scenes, as demonstrated on feed-forward and 360° datasets. Preserves unchanged scene content beyond the designated editing region, ensuring minimal unintended modifications. Maintains compatibility with various NeRF representations, allowing for flexible integration with existing scene models.	The method's ability to modify regions outside the user-specified 3D bounding box, such as reflections, is limited. Reliance on SDS loss may introduce limitations inherent to that technique, such as the Janus problem.	neural radiance fields, 3d object generation, scene editing, compositional rendering, text-to-3d
2401.05735 Report	Object-Centric Diffusion for Efficient Video Editing	Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian	Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.	This paper introduces Object-Centric Diffusion (OCD), a set of techniques for speeding up diffusion-based video editing by focusing computation on edited foreground objects.	Diffusion-based video editing models are computationally expensive, especially those using inversion or cross-frame attention for temporal consistency.	The authors propose (1) Object-Centric Sampling: separating diffusion sampling for foreground and background, with fewer steps on the latter, and (2) Object-Centric 3D Token Merging: reducing cross-frame attention tokens by fusing redundant ones predominantly in background regions.	OCD speeds up inversion-based editing by 10x and ControlNet-based editing by 6x without sacrificing fidelity. Object-Centric Sampling is especially effective for smaller objects, achieving up to 2x additional speed-up. OCD reduces memory consumption for attention maps by 17x.	OCD is less effective for global video editing tasks. Hyperparameter tuning is still required per-sequence.	video editing, diffusion models, efficiency, object-centric, token merging
2401.05675 Report	Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation	Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang	Recent works demonstrate that using reinforcement learning (RL) with quality rewards can enhance the quality of generated images in text-to-image (T2I) generation. However, a simple aggregation of multiple rewards may cause over-optimization in certain metrics and degradation in others, and it is challenging to manually find the optimal weights. An effective strategy to jointly optimize multiple rewards in RL for T2I generation is highly desirable. This paper introduces Parrot, a novel multi-reward RL framework for T2I generation. Through the use of the batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards during the RL optimization of the T2I generation. Additionally, Parrot employs a joint optimization approach for the T2I model and the prompt expansion network, facilitating the generation of quality-aware text prompts, thus further enhancing the final image quality. To counteract the potential catastrophic forgetting of the original user prompt due to prompt expansion, we introduce original prompt centered guidance at inference time, ensuring that the generated image remains faithful to the user input. Extensive experiments and a user study demonstrate that Parrot outperforms several baseline methods across various quality criteria, including aesthetics, human preference, image sentiment, and text-image alignment.	Presents Parrot, a novel framework for improving text-to-image generation using multi-reward reinforcement learning, enabling joint optimization of image quality and prompt expansion.	Existing text-to-image models struggle to consistently produce high-quality images across various aspects, such as aesthetics, alignment with user input, and emotional impact. Manually balancing these factors is challenging, necessitating a more efficient optimization approach.	Parrot employs batch-wise Pareto-optimal selection to identify and leverage samples that achieve the best trade-offs between multiple quality rewards, enabling simultaneous optimization across aesthetics, human preference, image sentiment, and text-image alignment. It also jointly optimizes the text-to-image model and a prompt expansion network for generating quality-aware prompts. To preserve faithfulness to the original user prompt, it incorporates original prompt-centered guidance during inference.	Parrot consistently outperforms baseline methods in generating images with improved aesthetics, human preference alignment, image sentiment, and text-image alignment. Joint optimization of the prompt expansion network and the text-to-image model proves superior to individually fine-tuning either component. Original prompt-centered guidance effectively mitigates the risk of catastrophic forgetting, ensuring generated images remain faithful to the initial user input while incorporating expanded details.	The effectiveness of Parrot relies on the quality and comprehensiveness of the image quality metrics used. Further exploration of additional rewards and improved quality metrics can enhance Parrot's capabilities.	text-to-image generation, reinforcement learning, multi-objective optimization, prompt expansion, image quality assessment
2401.05583 Report	Diffusion Priors for Dynamic View Synthesis from Monocular Videos	Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, Sergey Tulyakov	Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis.	This paper presents a novel method for dynamic novel view synthesis from monocular videos, leveraging the power of 2D diffusion priors to address the limitations of hand-crafted priors used in previous works.	Existing methods for dynamic novel view synthesis struggle to handle self-occlusions, out-of-view details, and complex motions, especially when relying solely on information from reference views. This work explores the use of 2D diffusion priors to overcome these limitations and improve the quality of dynamic scene reconstruction.	The proposed method represents a 4D scene using separate NeRFs for static and dynamic regions. It employs a combination of reconstruction losses on existing views and an SDS loss with RGB-D diffusion priors for novel views. Additionally, Dreambooth fine-tuning is used to personalize the diffusion model and preserve scene identity.	The method generates visually superior novel views compared to existing state-of-the-art methods, particularly in handling complex object motions and hallucinating unseen regions. Quantitative evaluation on the iPhone dataset demonstrates competitive performance in terms of mLPIPS and mSSIM scores, although these metrics are found to not fully capture the perceived visual quality. User studies confirm that the proposed method produces more realistic and visually pleasing results compared to baselines, highlighting the benefits of using 2D diffusion priors.	The method is computationally expensive, requiring high-end GPUs for training and limiting the achievable rendering resolution. Future work could explore more efficient representations and lighter diffusion models. Temporal smoothness relies on the multi-level design of instant-NGP, which might be insufficient for complex scenarios. Exploring stronger video diffusion models for temporal consistency is an area for future research.	novel view synthesis, dynamic scene reconstruction, diffusion models, neural radiance fields (nerf), dreambooth
2401.05516 Report	FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields	GeonU Kim, Kim Youwang, Tae-Hyun Oh	We present FPRF, a feed-forward photorealistic style transfer method for large-scale 3D neural radiance fields. FPRF stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view appearance consistency. Prior arts required tedious per-style/-scene optimization and were limited to small-scale 3D scenes. FPRF efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D neural radiance field, which inherits AdaIN's feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPRF supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPRF also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPRF achieves favorable photorealistic quality 3D scene stylization for large-scale scenes with diverse reference images. Project page: https://kim-geonu.github.io/FPRF/	Presents FPRF, a feed-forward photorealistic style transfer method for large-scale 3D neural radiance fields using adaptive instance normalization (AdaIN).	Existing 3D scene style transfer methods are computationally expensive, requiring per-style or per-scene optimization, and don't scale well to large scenes.	Trains a stylizable radiance field comprised of a scene content field for geometry and appearance and a scene semantic field for local style matching. Employs a pre-trained MLP color decoder for generalization. Uses a style dictionary of clustered style reference images for efficient semantic matching and local AdaIN style transfer.	Achieves multi-view consistent style transfer on large-scale scenes, unlike 2D methods. Outperforms competing 3D style transfer methods on small-scale scenes in terms of quality and efficiency. Successfully transfers styles from multiple reference images based on semantic correspondence with the scene.	Semantic matching performance is limited by the DINO semantic encoder. Future work includes exploring more advanced semantic encoders.	style transfer, neural radiance fields, 3d scene stylization, adaptive instance normalization, semantic matching
2401.05293 Report	Score Distillation Sampling with Learned Manifold Corrective	Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu	Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects. Instead, we train a shallow network mimicking the timestep-dependent denoising deficiency of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through several qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.	This paper provides an in-depth analysis of the Score Distillation Sampling (SDS) loss function, identifying a noise issue and proposing a solution called LMC-SDS (Score Distillation Sampling with Learned Manifold Corrective) to provide better gradients for improved image fidelity.	SDS, while popular for controlling optimization problems with text prompts, suffers from issues like blurry results at low guidance and artifacts at high guidance. This limits its effectiveness and applicability.	The authors decompose the SDS loss, pinpoint the component responsible for noisy gradients, and introduce LMC-SDS. This involves training a shallow network to approximate the diffusion model's denoising deficiencies and using it to achieve cleaner gradients.	LMC-SDS generates higher quality images with balanced colors compared to the original SDS, especially at lower guidance levels. It excels in optimization-based image editing, preserving image structure while effectively aligning with target prompts. LMC-SDS proves beneficial in training image-to-image translation networks and enhancing text-to-3D models like DreamFusion.	LMC-SDS relies on the diffusion model's understanding of prompts, limiting its effectiveness for ambiguous prompts. The method may struggle with optimization states that deviate significantly from the natural image manifold.	image diffusion models, score distillation sampling, image synthesis, image editing, text-to-3d synthesis
2401.05252 Report	PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models	Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li	This technical report introduces PIXART-{\delta}, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-{\delta} significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta} achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-{\alpha}. Additionally, PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.	\model is a novel text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and a novel ControlNet-Transformer architecture into the \pixarta model to achieve accelerated inference speed and fine-grained controllability.	This integration enables the generation of high-quality, controllable images at a 1024px resolution in a mere 0.5 seconds, a significant improvement over existing methods.	The authors incorporate LCM into \model for faster inference and propose a ControlNet-Transformer architecture tailored for the \pixarta model to enhance controllability over the generated images. The model is trained on a 120K internal image-text dataset and ablations are conducted on the network architecture and training hyperparameters.	\model achieves a breakthrough 0.5 seconds for generating 1024 × 1024 pixel images, marking a 7× improvement over \pixarta. The ControlNet-Transformer architecture effectively controls the generation process while maintaining high image quality. The model can be trained on a single 32GB V100 GPU within a day and performs inference with 8-bit precision using only 8GB of GPU memory.	The ControlNet module is only explored with HED edge maps as a conditioning input. Exploration of larger batch sizes and alternative sampling methods is left for future work.	text-to-image synthesis, latent consistency model, controlnet, transformer, fast inference
2401.05224 Report	Do Vision and Language Encoders Represent the World Similarly?	Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor	Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.	This paper investigates the inherent alignment between unaligned vision and language encoders and leverages it for downstream cross-modal tasks in a training-free manner using Centered Kernel Alignment (CKA).	Aligned text-image encoders are the standard for vision-language tasks, but require extensive training. This work explores whether this training is necessary due to the inherent alignment stemming from both modalities representing the same world.	The authors leverage CKA similarity between vision and language encoders and propose two methods: 1) Quadratic Assignment Problem (QAP) optimization to maximize CKA for matching, and 2) A novel localized CKA metric for retrieval tasks.	Unaligned encoders exhibit surprisingly high semantic similarity, comparable to aligned encoders, especially when trained on large, diverse datasets. The proposed QAP matching and localized CKA retrieval methods outperform baseline methods like linear regression and relative representations on cross-domain and cross-lingual caption matching/retrieval tasks. The methods are shown to be effective for image classification on ImageNet-100 and cross-lingual image retrieval using multilingual sentence transformers.	The computational complexity of QAP matching and local CKA retrieval is higher than baseline methods, although the authors propose potential optimizations. The study primarily focuses on global image-caption alignment and could be extended to explore finer-grained alignments.	vision-language models, zero-shot learning, centered kernel alignment, cross-modal retrieval, cross-lingual retrieval
2401.05097 Report	Any-Way Meta Learning	Junhoo Lee, Yearim Kim, Hyunho Lee, Nojun Kwak	Although meta-learning seems promising performance in the realm of rapid adaptability, it is constrained by fixed cardinality. When faced with tasks of varying cardinalities that were unseen during training, the model lacks its ability. In this paper, we address and resolve this challenge by harnessing `label equivalence' emerged from stochastic numeric label assignments during episodic task sampling. Questioning what defines ``true" meta-learning, we introduce the ``any-way" learning paradigm, an innovative model training approach that liberates model from fixed cardinality constraints. Surprisingly, this model not only matches but often outperforms traditional fixed-way models in terms of performance, convergence speed, and stability. This disrupts established notions about domain generalization. Furthermore, we argue that the inherent label equivalence naturally lacks semantic information. To bridge this semantic information gap arising from label equivalence, we further propose a mechanism for infusing semantic class information into the model. This would enhance the model's comprehension and functionality. Experiments conducted on renowned architectures like MAML and ProtoNet affirm the effectiveness of our method.	This paper proposes "any-way" few-shot learning, overcoming the limitation of fixed cardinality in conventional meta-learning by leveraging "label equivalence".	Current meta-learning models struggle to adapt to new tasks with varying numbers of classes (different "ways"), hindering their application in real-world scenarios requiring flexibility.	The authors utilize the concept of "label equivalence" arising from stochastic numeric label assignments during episodic task sampling. They propose an "any-way" learning method, allowing the model to handle tasks with any cardinality. They further introduce a mechanism to inject semantic class information into the model.	The proposed "any-way" meta-learning model matches or outperforms traditional fixed-way models in performance, convergence speed, and stability. The model exhibits strong domain generalization capabilities, adapting well to unseen task cardinalities and datasets. Injecting semantic class information improves performance, particularly for fine-grained datasets, and allows incorporating techniques like Mixup from supervised learning.	Performance degradation can occur when applying Mixup in "same" scenarios due to the trade-off between generality and specificity. Further research on advanced algorithms to exploit label equivalence and enhance ensemble techniques is needed.	meta-learning, few-shot learning, label equivalence, domain generalization, semantic class information
2401.05011 Report	Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection	Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, Hanwang Zhang	Semi-supervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared to its 2D counterpart due to the greater effort required to collect 3D data. Moreover, the loose consistency regularization in SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to either low-quality supervision or a limited amount of pseudo labels. To address these issues, we present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective. Specifically, from the data-perspective, we propose a class-probabilistic data augmentation method that augments the input data with additional instances based on the varying distribution of class probabilities. Our DPKE achieves feature-perspective knowledge enrichment by designing a geometry-aware feature matching method that regularizes feature-level similarity between object proposals from the student and teacher models. Extensive experiments on the two benchmark datasets demonstrate that our DPKE achieves superior performance over existing state-of-the-art approaches under various label ratio conditions. The source code will be made available to the public.	This paper proposes DPKE, a Dual-Perspective Knowledge Enrichment approach for semi-supervised 3D object detection, addressing the challenges of limited data diversity and effective pseudo-label utilization in cluttered indoor scenes.	Annotating 3D data, especially for cluttered indoor scenes, is expensive and time-consuming. Semi-supervised methods aim to alleviate this by learning from both labeled and unlabeled data. However, existing methods suffer from limited data diversity and low-quality or low-recall pseudo labels.	DPKE enriches knowledge from two perspectives: 1) Data-perspective: employs a class-probabilistic data augmentation method to diversify training data by inserting instances from a proposal bank into scenes based on class probabilities. 2) Feature-perspective: utilizes a geometry-aware feature matching method to regularize feature-level similarity between student and teacher model proposals, focusing on potential foreground proposals based on geometry similarity.	DPKE achieves superior performance over existing state-of-the-art methods on ScanNet and SUN RGB-D datasets under various label ratios. Class-probabilistic data augmentation effectively handles limited diversity by increasing the presence of less-learned categories. Geometry-aware feature matching improves pseudo-label recall by leveraging feature-level similarity with geometry constraints.	The improvement on SUN RGB-D is less significant than ScanNet potentially due to lower point cloud quality and ground truth proposal accuracy. Future work could explore alternative data augmentation techniques or other feature matching strategies for further performance improvement.	semi-supervised learning, 3d object detection, data augmentation, feature matching, knowledge distillation
2401.04861 Report	CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video	Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Yang Long, Yefeng Zheng	The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views.	This paper proposes CTNeRF, a novel dynamic neural radiance field method for synthesizing high-quality novel views from monocular videos of dynamic scenes by aggregating multi-view features.	Existing methods struggle to accurately model complex object motion in dynamic scenes, leading to blurry or inaccurate renderings.	The method uses a ray-based cross-time (RBCT) aggregation module to capture temporal relationships between features and a global spatio-temporal filter (GSTF) to model motion in the frequency domain.	CTNeRF achieves state-of-the-art results on the Nvidia Dynamic Scene Dataset, outperforming existing methods in most tested scenarios. The RBCT and GSTF modules are shown to be crucial for improving the quality of synthesized views, enhancing detail and reducing artifacts. The method shows comparable performance to existing techniques on the iPhone dataset and even surpasses them in some cases.	The method may not perform optimally when rendering novel views for long-sequence videos due to limited aggregation view length. Fine details might be lost during feature aggregation, particularly in scenes with small non-rigid deformations.	dynamic neural radiance field, monocular video, novel view synthesis, scene flow, transformer
2401.04728 Report	Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation	Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang	Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.	This paper introduces a novel morphable diffusion model for controllable, photorealistic human avatar creation from a single image, integrating a 3D morphable model with multi-view consistent diffusion for improved novel view synthesis and facial expression control.	Existing methods for generating photorealistic avatars often require extensive visual input, lack controllability, or struggle with 3D consistency, limiting their use in creating animatable and realistic avatars from minimal input.	The method combines a 3D morphable model with a multi-view diffusion model. A 3D morphable model unprojects noisy image features to 3D space, guiding the diffusion process for improved reconstruction and animation. A shuffled training scheme disentangles reconstruction and animation, enabling novel expression synthesis from a single image.	The model outperforms baselines in novel view synthesis of faces and bodies, achieving higher scores on LPIPS, SSIM, FID, and PCK metrics. It effectively synthesizes novel facial expressions from a single image, demonstrating superior quality and controllability compared to existing methods. Quantitative and qualitative evaluations highlight the model's ability to generate high-fidelity, animatable avatars with improved 3D consistency.	The model's generalizability is limited by the training data's ethnic and hairstyle diversity, primarily featuring Asian subjects with a specific cap. Generalization to out-of-distribution camera parameters remains a challenge, requiring external methods for comprehensive 3D reconstruction and free-view synthesis.	diffusion models, 3d morphable models, avatar creation, novel view synthesis, facial expression control
2401.04716 Report	Low-Resource Vision Challenges for Foundation Models	Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek	Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for deep learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we address this gap and explore the challenges of low-resource image tasks with vision foundation models. We first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share three challenges: data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on our three low-resource tasks demonstrate our proposals already provide a better baseline than transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation. Project page: https://xiaobai1217.github.io/Low-Resource-Vision/.	This paper investigates the challenges of low-resource image recognition and presents a benchmark covering historic maps, circuit diagrams, and mechanical drawings.	Low-resource scenarios are common in computer vision, but under-explored, making it crucial to understand these challenges and adapt existing methods.	The authors collect a benchmark of low-resource image data, analyze its challenges, and propose three baselines to address data scarcity, fine-grained details, and specialized domain shift.	Existing foundation models struggle to generalize to the specialized domains of low-resource vision tasks. Simple transformations and existing fine-grained recognition methods fail to handle the limited data and domain shift. The proposed baselines, especially generated data augmentation, improve performance over zero-shot transfer and existing transfer learning methods.	The proposed baselines are an initial step and struggle to fully address the complex relationships and rare image styles in low-resource vision. Future work should explore more diverse generated data, consider inter-region relationships, and adapt foundation models to non-natural images.	low-resource vision, foundation models, transfer learning, data augmentation, fine-grained recognition
2401.04651 Report	Learning to Prompt Segment Anything Models	Jiaxing Huang, Kai Jiang, Jingyi Zhang, Han Qiu, Lewei Lu, Shijian Lu, Eric Xing	Segment Anything Models (SAMs) like SEEM and SAM have demonstrated great potential in learning to segment anything. The core design of SAMs lies with Promptable Segmentation, which takes a handcrafted prompt as input and returns the expected segmentation mask. SAMs work with two types of prompts including spatial prompts (e.g., points) and semantic prompts (e.g., texts), which work together to prompt SAMs to segment anything on downstream datasets. Despite the important role of prompts, how to acquire suitable prompts for SAMs is largely under-explored. In this work, we examine the architecture of SAMs and identify two challenges for learning effective prompts for SAMs. To this end, we propose spatial-semantic prompt learning (SSPrompt) that learns effective semantic and spatial prompts for better SAMs. Specifically, SSPrompt introduces spatial prompt learning and semantic prompt learning, which optimize spatial prompts and semantic prompts directly over the embedding space and selectively leverage the knowledge encoded in pre-trained prompt encoders. Extensive experiments show that SSPrompt achieves superior image segmentation performance consistently across multiple widely adopted datasets.	This paper proposes spatial-semantic prompt learning (SSPrompt), a novel prompt learning technique for Segment Anything Models (SAMs) that enhances segmentation performance on downstream datasets by directly optimizing spatial and semantic prompts in the embedding space.	Existing SAMs often underperform when using default prompts on downstream datasets. Optimizing prompts for these models is crucial to unlocking their full potential, especially in few-shot learning scenarios.	SSPrompt leverages two key components: 1) spatial prompt learning (SpaPrompt), which optimizes spatial prompts in a high-dimensional embedding space to overcome limitations of the 2D coordinate system, and 2) semantic prompt learning (SemPrompt), which efficiently optimizes semantic prompts in the embedding space and selectively utilizes knowledge from pretrained text encoders.	SSPrompt consistently outperforms state-of-the-art prompt learning methods across various image segmentation datasets, including Cityscapes, BDD100K, Mapillary, ADE20K, PASCAL Context, and ACDC. Ablation studies highlight the effectiveness of optimizing prompts in the embedding space and selectively leveraging knowledge from pretrained prompt encoders. SSPrompt demonstrates robustness to varying training data sizes, effectively improving performance even with limited data.	The paper primarily focuses on SEEM due to the unavailability of open-sourced SAM versions with text prompt encoders, potentially limiting the generalizability of the findings to different SAM architectures. Future work could explore prompt learning techniques for other recently released semantic-aware SAMs, such as Semantic SAM and SAM-CLIP.	prompt learning, segment anything model (sam), image segmentation, few-shot learning, computer vision
2401.04608 Report	EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models	Jingyuan Yang, Jiawei Feng, Hui Huang	Recent years have witnessed remarkable progress in image generation task, where users can create visually astonishing images with high-quality. However, existing text-to-image diffusion models are proficient in generating concrete concepts (dogs) but encounter challenges with more abstract ones (emotions). Several efforts have been made to modify image emotions with color and style adjustments, facing limitations in effectively conveying emotions with fixed image contents. In this work, we introduce Emotional Image Content Generation (EICG), a new task to generate semantic-clear and emotion-faithful images given emotion categories. Specifically, we propose an emotion space and construct a mapping network to align it with the powerful Contrastive Language-Image Pre-training (CLIP) space, providing a concrete interpretation of abstract emotions. Attribute loss and emotion confidence are further proposed to ensure the semantic diversity and emotion fidelity of the generated images. Our method outperforms the state-of-the-art text-to-image approaches both quantitatively and qualitatively, where we derive three custom metrics, i.e., emotion accuracy, semantic clarity and semantic diversity. In addition to generation, our method can help emotion understanding and inspire emotional art design.	Introduces Emotional Image Content Generation (EICG), a novel task to generate images with clear semantics that evoke specific emotions, addressing the limitations of text-to-image models in handling abstract concepts like emotions.	Current text-to-image models excel at concrete concepts but struggle with abstract ones like emotions. Existing emotion modification methods are limited by fixed image content. EICG aims to bridge this gap by generating images that are both semantically meaningful and emotionally evocative.	Proposes a mapping network to align a learned emotion space with the CLIP space, leveraging an attribute loss based on EmoSet's annotations and an emotion confidence mechanism to ensure semantic clarity, diversity, and emotion fidelity.	Outperforms state-of-the-art text-to-image generation methods in generating images with higher fidelity, diversity, semantic clarity, and emotion accuracy. User study confirms the superiority of the proposed method in terms of image fidelity, emotion faithfulness, and semantic diversity. Demonstrates potential applications in emotion decomposition, emotion transfer for image editing and design, and emotion fusion for creating complex emotional experiences.	Current work focuses on content and could be enhanced by incorporating other visual elements like color and style. The emotion-content relationship is simplified as binary, neglecting the nuanced emotional associations of certain objects/scenes.	emotion generation, text-to-image synthesis, visual emotion analysis, contrastive language-image pretraining (clip), diffusion models
2401.04468 Report	MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation	Weimin Wang, Jiawei Liu, Zhijie Lin, Jiangqiao Yan, Shuo Chen, Chetwin Low, Tuyen Hoang, Jie Wu, Jun Hao Liew, Hanshu Yan, Daquan Zhou, Jiashi Feng	The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.	MagicVideo-V2, a novel multi-stage Text-to-Video (T2V) framework that generates high-fidelity and smooth videos from text descriptions.	Addresses the growing demand for high-fidelity video generation from textual descriptions.	Integrates Text-to-Image (T2I), Image-to-Video (I2V), Video-to-Video (V2V), and Video Frame Interpolation (VFI) modules into an end-to-end pipeline.	Generates aesthetically pleasing, high-resolution videos with remarkable fidelity and smoothness. Outperforms leading T2V systems like Runway, Pika 1.0, Morph, Moon Valley, and Stable Video Diffusion in large-scale user evaluations. Demonstrates superior performance in generating smooth and high-aesthetic videos through qualitative examples.	Limited diversity and volume in video training datasets. Reliance on human evaluation for performance assessment.	text-to-video generation, video generation, diffusion models, video frame interpolation, high-fidelity video synthesis
2401.04463 Report	D3AD: Dynamic Denoising Diffusion Probabilistic Model for Anomaly Detection	Justin Tebbe, Jawad Tayyub	Diffusion models have found valuable applications in anomaly detection by capturing the nominal data distribution and identifying anomalies via reconstruction. Despite their merits, they struggle to localize anomalies of varying scales, especially larger anomalies like entire missing components. Addressing this, we present a novel framework that enhances the capability of diffusion models, by extending the previous introduced implicit conditioning approach Meng et al. (2022) in three significant ways. First, we incorporate a dynamic step size computation that allows for variable noising steps in the forward process guided by an initial anomaly prediction. Second, we demonstrate that denoising an only scaled input, without any added noise, outperforms conventional denoising process. Third, we project images in a latent space to abstract away from fine details that interfere with reconstruction of large missing components. Additionally, we propose a fine-tuning mechanism that facilitates the model to effectively grasp the nuances of the target domain. Our method undergoes rigorous evaluation on two prominent anomaly detection datasets VISA and BTAD, yielding state-of-the-art performance. Importantly, our framework effectively localizes anomalies regardless of their scale, marking a pivotal advancement in diffusion-based anomaly detection.	This paper proposes D3AD, a novel diffusion model-based anomaly detection framework that enhances anomaly localization by introducing dynamic implicit conditioning, using a noiseless scaled input, and leveraging a latent diffusion model.	Existing diffusion models struggle to localize anomalies of varying scales, especially large ones. This work aims to overcome this limitation and improve the accuracy of anomaly detection in industrial settings where accurate localization is crucial.	The proposed D3AD method uses a dynamic implicit conditioning mechanism to determine the level of perturbation based on an initial anomaly estimate using KNN distances of domain-adapted features. It avoids initial noising and instead uses a scaled input for improved anomaly segmentation. A latent diffusion model is used to improve efficiency and handle large anomalies.	D3AD achieves state-of-the-art anomaly segmentation performance on the VisA benchmark, outperforming previous methods by a significant margin (2.7% higher PRO score). The dynamic implicit conditioning mechanism effectively identifies large anomalies without compromising performance on smaller ones. Ablation studies confirm the individual contributions of domain adaptation, noiseless scaling, and dynamic implicit conditioning to D3AD's performance.	The inference speed of D3AD is slower than some existing methods, requiring further optimization for real-time applications. Future work could explore precomputed features and more efficient approximations for anomaly severity to enhance inference speed.	anomaly detection, diffusion models, dynamic implicit conditioning, unsupervised learning, computer vision
2401.04339 Report	Memory-Efficient Personalization using Quantized Diffusion Model	Hyogon Ryu, Seohyun Lim, Hyunjung Shim	The rise of billion-parameter diffusion models like Stable Diffusion XL, Imagen, and Dall-E3 markedly advances the field of generative AI. However, their large-scale nature poses challenges in fine-tuning and deployment due to high resource demands and slow inference speed. This paper ventures into the relatively unexplored yet promising realm of fine-tuning quantized diffusion models. We establish a strong baseline by customizing three models: PEQA for fine-tuning quantization parameters, Q-Diffusion for post-training quantization, and DreamBooth for personalization. Our analysis reveals a notable trade-off between subject and prompt fidelity within the baseline model. To address these issues, we introduce two strategies, inspired by the distinct roles of different timesteps in diffusion models: S1 optimizing a single set of fine-tuning parameters exclusively at selected intervals, and S2 creating multiple fine-tuning parameter sets, each specialized for different timestep intervals. Our approach not only enhances personalization but also upholds prompt fidelity and image quality, significantly outperforming the baseline qualitatively and quantitatively. The code will be made publicly available.	This paper addresses the challenge of fine-tuning large, quantized diffusion models for personalization, proposing two novel strategies to improve efficiency and performance.	Fine-tuning large diffusion models like Stable Diffusion XL is computationally expensive. This work enables efficient personalization of these models by using quantized (low-precision) weights, saving memory and computation.	The authors first establish a baseline by combining Q-Diffusion (for post-training quantization), DreamBooth (for personalization), and PEQA (for fine-tuning quantization parameters). Then, they introduce two strategies: (S1) selective fine-tuning at specific timesteps crucial for learning the target subject and (S2) specialized fine-tuning with multiple parameter sets tailored to different timestep intervals.	Both S1 and S2 outperform the baseline in terms of subject fidelity, prompt fidelity, and image quality. S2 generally shows better performance than S1, but requires three times more computation for fine-tuning. The proposed methods achieve comparable performance to full-precision fine-tuning while using quantized weights.	Quantization can sometimes lead to unwanted artifacts like cast shadows in generated images. The current method does not support Low-Rank Adaptation (LoRA), which could further enhance model versatility.	diffusion models, quantization, personalization, fine-tuning, computer vision
2401.04247 Report	Robust Image Watermarking using Stable Diffusion	Lijun Zhang, Xiao Liu, Antoni Viros Martin, Cindy Xiong Bearfield, Yuriy Brun, Hui Guan	Watermarking images is critical for tracking image provenance and claiming ownership. With the advent of generative models, such as stable diffusion, able to create fake but realistic images, watermarking has become particularly important, e.g., to make generated images reliably identifiable. Unfortunately, the very same stable diffusion technology can remove watermarks injected using existing methods. To address this problem, we present a ZoDiac, which uses a pre-trained stable diffusion model to inject a watermark into the trainable latent space, resulting in watermarks that can be reliably detected in the latent vector, even when attacked. We evaluate ZoDiac on three benchmarks, MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against state-of-the-art watermark attacks, with a watermark detection rate over 98% and a false positive rate below 6.4%, outperforming state-of-the-art watermarking methods. Our research demonstrates that stable diffusion is a promising approach to robust watermarking, able to withstand even stable-diffusion-based attacks.	\pjn is a novel zero-shot watermarking framework based on stable diffusion that embeds invisible watermarks in the latent space of images, making it robust even to attacks utilizing stable diffusion.	Watermarking images is crucial for proving ownership and tracking provenance, especially with the rise of AI-generated content. Existing methods are vulnerable to attacks that leverage generative AI, particularly stable diffusion, to remove watermarks.	\pjn initializes a latent vector from an image using DDIM inversion, encodes a ring-like watermark in the vector's Fourier space, and optimizes the watermarked vector to generate a perceptually similar image. It then adaptively mixes the watermarked and original images to further improve visual quality. Watermark detection is performed by applying DDIM inversion, Fourier transformation, and statistical testing on the latent vector.	\pjn achieves high watermark detection rates (above 98%) and low false positive rates (below 6.4%) even against state-of-the-art attacks, outperforming existing methods, especially against stable diffusion-based removal attacks and combined attacks. It maintains high image quality with PSNR > 30dB and SSIM > 0.9, exceeding the quality of the most robust existing method. \pjn is flexible and can be applied with different pre-trained stable diffusion backbones while maintaining its effectiveness.	The current implementation of \pjn is limited to zero-bit watermarking, meaning it can only embed a mark and detect its presence but cannot encode meaningful messages. While \pjn shows robustness against most attacks, it is vulnerable to rotation attacks. A proposed solution involves automatically correcting the image orientation before detection, but this increases the false positive rate, necessitating further exploration for a better trade-off.	watermarking, stable diffusion, generative ai, robustness, zero-shot
2401.04136 Report	The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline	Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, Kenji Kawaguchi	The commercialization of text-to-image diffusion models (DMs) brings forth potential copyright concerns. Despite numerous attempts to protect DMs from copyright issues, the vulnerabilities of these solutions are underexplored. In this study, we formalized the Copyright Infringement Attack on generative AI models and proposed a backdoor attack method, SilentBadDiffusion, to induce copyright infringement without requiring access to or control over training processes. Our method strategically embeds connections between pieces of copyrighted information and text references in poisoning data while carefully dispersing that information, making the poisoning data inconspicuous when integrated into a clean dataset. Our experiments show the stealth and efficacy of the poisoning data. When given specific text prompts, DMs trained with a poisoning ratio of 0.20% can produce copyrighted images. Additionally, the results reveal that the more sophisticated the DMs are, the easier the success of the attack becomes. These findings underline potential pitfalls in the prevailing copyright protection strategies and underscore the necessity for increased scrutiny to prevent the misuse of DMs.	This paper proposes SilentBadDiffusion, a novel backdoor attack to induce copyright infringement in text-to-image diffusion models by poisoning the training data.	This work exposes vulnerabilities in current copyright protection strategies relying on access restriction and highlights the need for more robust methods.	SilentBadDiffusion dissects copyrighted images into elements, generates non-infringing images with those elements, and trains the model on this poisoned dataset, embedding connections that are triggered by specific prompts.	Diffusion models trained on poisoned datasets with a small poisoning ratio (e.g., 0.20%) can generate copyrighted images when prompted with specific triggers. The poisoning data seamlessly blends with clean data, making detection difficult. More advanced diffusion models, with stronger composition abilities, are more susceptible to this attack.	The current attack assumes decomposable copyrighted images, future work can explore broader target types. Future research can explore defenses against this attack and investigate theoretical foundations of memorization and generalization in diffusion models.	generative ai, diffusion model, data poisoning attack, copyright infringement attack, memorization
2401.04099 Report	AGG: Amortized Generative 3D Gaussians for Single Image to 3D	Dejia Xu, Ye Yuan, Morteza Mardani, Sifei Liu, Jiaming Song, Zhangyang Wang, Arash Vahdat	Given the growing need for automatic 3D content creation pipelines, various 3D representations have been studied to generate 3D objects from a single image. Due to its superior rendering efficiency, 3D Gaussian splatting-based models have recently excelled in both 3D reconstruction and generation. 3D Gaussian splatting approaches for image to 3D generation are often optimization-based, requiring many computationally expensive score-distillation steps. To overcome these challenges, we introduce an Amortized Generative 3D Gaussian framework (AGG) that instantly produces 3D Gaussians from a single image, eliminating the need for per-instance optimization. Utilizing an intermediate hybrid representation, AGG decomposes the generation of 3D Gaussian locations and other appearance attributes for joint optimization. Moreover, we propose a cascaded pipeline that first generates a coarse representation of the 3D data and later upsamples it with a 3D Gaussian super-resolution module. Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster. Project page: https://ir1d.github.io/AGG/	Introduces AGG, a novel cascaded generative framework that produces 3D Gaussian-based objects from a single image without per-instance optimization.	Addresses the growing need for automatic 3D content creation pipelines and the limitations of optimization-based 3D Gaussian generation approaches.	Utilizes a hybrid generator for coarse Gaussian prediction, followed by a UNet-based super-resolution module for refinement; decomposes geometry and texture generation for joint optimization.	AGG demonstrates competitive generation quality compared to existing optimization-based 3D Gaussian pipelines and sampling-based frameworks. AGG achieves significantly faster inference speeds (several orders of magnitude) compared to baselines. Ablation studies confirm the effectiveness of the proposed hybrid generator and super-resolution module.	The number of generated 3D Gaussians is limited for representing highly complex geometry. Future work will focus on extending AGG to handle multiple objects and occlusions.	3d gaussian splatting, image-to-3d generation, amortized generation, hybrid representation, super-resolution
2401.04092 Report	GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation	Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein	Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each, such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies, however, can be very expensive to scale. This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts, which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally, we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria.	This paper introduces an automatic evaluation metric for text-to-3D generative models using GPT-4V, aiming for versatility and human-alignment.	Existing metrics often lack flexibility for diverse evaluation criteria and may not align well with human judgment, hindering progress in text-to-3D generation.	The method involves a prompt generator creating diverse input prompts and a 3D assets evaluator using GPT-4V to compare generated 3D shapes based on user-defined criteria, ultimately assigning Elo ratings to models.	The proposed metric exhibits stronger alignment with human judgment across various evaluation criteria compared to existing metrics like CLIP similarity and PickScore. The method allows for holistic evaluation, revealing relative strengths and weaknesses among different text-to-3D models. The framework can be extended to assess other criteria, such as the diversity of generated 3D assets.	The study's scale is limited due to resource constraints, necessitating larger-scale verification. The reliance on GPT-4V introduces potential biases and limitations, requiring mitigation strategies and further investigation.	text-to-3d generation, evaluation metrics, gpt-4v, human alignment, 3d shape comparison
2401.03890 Report	A Survey on 3D Gaussian Splatting	Guikun Chen, Wenguan Wang	3D Gaussian splatting (GS) has recently emerged as a transformative technique in the realm of explicit radiance field and computer graphics. This innovative approach, characterized by the utilization of millions of learnable 3D Gaussians, represents a significant departure from mainstream neural radiance field approaches, which predominantly use implicit, coordinate-based models to map spatial coordinates to pixel values. 3D GS, with its explicit scene representation and differentiable rendering algorithm, not only promises real-time rendering capability but also introduces unprecedented levels of editability. This positions 3D GS as a potential game-changer for the next generation of 3D reconstruction and representation. In the present paper, we provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS. We begin with a detailed exploration of the underlying principles and the driving forces behind the emergence of 3D GS, laying the groundwork for understanding its significance. A focal point of our discussion is the practical applicability of 3D GS. By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond. This is complemented by a comparative analysis of leading 3D GS models, evaluated across various benchmark tasks to highlight their performance and practical utility. The survey concludes by identifying current challenges and suggesting potential avenues for future research in this domain. Through this survey, we aim to provide a valuable resource for both newcomers and seasoned researchers, fostering further exploration and advancement in applicable and explicit radiance field representation.	This paper presents the first comprehensive survey of 3D Gaussian splatting (3D GS), a novel technique for scene representation and rendering that utilizes millions of learnable 3D Gaussians.	3D GS represents a paradigm shift from implicit neural radiance field methods like NeRF, offering advantages such as real-time rendering capabilities and unprecedented editability.	The paper discusses the principles of 3D GS, including its forward process (splatting, rendering) and optimization process (parameter optimization, density control). It also explores various extensions of 3D GS, such as data-efficient and memory-efficient approaches, as well as applications in robotics, dynamic scene reconstruction, and AI-generated content.	3D GS based methods achieve state-of-the-art performance in various tasks, including localization, rendering quality (static and dynamic scenes), human avatar modeling, and surgical 3D reconstruction. 3D GS demonstrates significant advantages in terms of both accuracy and speed compared to NeRF based methods, particularly for applications requiring real-time performance. The explicit representation of 3D GS enables easier manipulation and editing of scenes, opening up new possibilities in content creation and scene understanding.	Current 3D GS techniques face challenges in modeling internal structures of objects and handling large-scale scene reconstruction. Further research is needed to explore the full potential of 3D GS in robotics, simulation, and other emerging applications.	3d gaussian splatting, explicit radiance field, real-time rendering, scene understanding, neural rendering
2401.03854 Report	TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment	Jiquan Yuan, Xinyan Cao, Jinming Che, Qinyuan Wang, Sen Liang, Wei Ren, Jinlong Lin, Xixin Cao	Recently, AIGC image quality assessment (AIGCIQA), which aims to assess the quality of AI-generated images (AIGIs) from a human perception perspective, has emerged as a new topic in computer vision. Unlike common image quality assessment tasks where images are derived from original ones distorted by noise, blur, and compression, \textit{etc.}, in AIGCIQA tasks, images are typically generated by generative models using text prompts. Considerable efforts have been made in the past years to advance AIGCIQA. However, most existing AIGCIQA methods regress predicted scores directly from individual generated images, overlooking the information contained in the text prompts of these images. This oversight partially limits the performance of these AIGCIQA methods. To address this issue, we propose a text-image encoder-based regression (TIER) framework. Specifically, we process the generated images and their corresponding text prompts as inputs, utilizing a text encoder and an image encoder to extract features from these text prompts and generated images, respectively. To demonstrate the effectiveness of our proposed TIER method, we conduct extensive experiments on several mainstream AIGCIQA databases, including AGIQA-1K, AGIQA-3K, and AIGCIQA2023. The experimental results indicate that our proposed TIER method generally demonstrates superior performance compared to baseline in most cases.	This paper introduces TIER, a text-image encoder-based regression framework for AIGC image quality assessment that leverages information from both generated images and their text prompts.	Existing AIGCIQA methods often fail to consider the valuable information present in the text prompts, limiting their assessment accuracy.	TIER utilizes a text encoder (BERT) to extract features from text prompts and an image encoder (ResNet, InceptionV4) to extract features from generated images. These features are concatenated and fed into a regression network to predict the quality score.	TIER generally outperforms baseline methods that ignore text prompt information on AGIQA-1K, AGIQA-3K, and AIGCIQA2023 datasets. The framework shows particular promise in predicting quality and authenticity scores. Performance improvement for correspondence scores is not always guaranteed, suggesting a need for better understanding of the image-text relationship in certain cases.	The method's performance in predicting correspondence scores can be inconsistent. Future work could explore more sophisticated methods for fusing text and image features.	aigc, aigciqa, image quality assessment, text encoder, image encoder
2401.03707 Report	FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring	Geunhyuk Youk, Jihyong Oh, Munchurl Kim	We present a joint learning scheme of video super-resolution and deblurring, called VSRDB, to restore clean high-resolution (HR) videos from blurry low-resolution (LR) ones. This joint restoration problem has drawn much less attention compared to single restoration problems. In this paper, we propose a novel flow-guided dynamic filtering (FGDF) and iterative feature refinement with multi-attention (FRMA), which constitutes our VSRDB framework, denoted as FMA-Net. Specifically, our proposed FGDF enables precise estimation of both spatio-temporally-variant degradation and restoration kernels that are aware of motion trajectories through sophisticated motion representation learning. Compared to conventional dynamic filtering, the FGDF enables the FMA-Net to effectively handle large motions into the VSRDB. Additionally, the stacked FRMA blocks trained with our novel temporal anchor (TA) loss, which temporally anchors and sharpens features, refine features in a course-to-fine manner through iterative updates. Extensive experiments demonstrate the superiority of the proposed FMA-Net over state-of-the-art methods in terms of both quantitative and qualitative quality. Codes and pre-trained models are available at: https://kaist-viclab.github.io/fmanet-site	This paper proposes FMA-Net, a novel framework for Video Super-Resolution and Deblurring (VSRDB) that effectively handles small-to-large motion and restores clean, high-resolution videos from blurry, low-resolution inputs.	VSRDB is crucial for enhancing video quality in real-world scenarios where videos often suffer from blur due to camera shake or object motion. Existing methods struggle to effectively address spatio-temporally variant degradations, limiting their performance.	FMA-Net leverages Flow-Guided Dynamic Filtering (FGDF) and Iterative Feature Refinement with Multi-Attention (FRMA) to learn motion-aware degradation kernels and iteratively refine features for joint restoration. It employs a two-stage training strategy, pre-training a degradation learning network followed by joint training with a restoration network.	FMA-Net significantly outperforms state-of-the-art VSR and deblurring methods, achieving notable PSNR, SSIM, and tOF improvements on REDS4, GoPro, and YouTube datasets. The proposed FGDF mechanism proves highly effective in handling large motions, leading to significant performance gains over conventional dynamic filtering. Ablation studies confirm the contribution of each component in FMA-Net, highlighting the effectiveness of multi-flow-mask pairs, temporal anchor loss, and the multi-attention module.	FMA-Net currently uses a two-stage training strategy, which requires additional training time compared to an end-to-end approach. Handling extreme conditions like object rotation remains challenging due to the difficulty in predicting accurate optical flow in such scenarios. Future work could explore incorporating learnable homography parameters or quaternion representations to address this.	video super-resolution, video deblurring, joint restoration, dynamic filtering, optical flow
2401.03433 Report	SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing	Songyan Chen, Jiancheng Huang	Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user's control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.	This paper introduces "Specific Reference Condition Real Image Editing," a new image editing task that allows users to control editing outcomes by providing a reference image, and proposes SpecRef, a fast, training-free baseline method for this task.	Existing non-reference editing methods limit user control as they only allow inputting a source image and text prompts, restricting users from specifying the desired characteristics of the edited output.	SpecRef extracts reference features from the self-attention layers of a pre-trained Stable Diffusion model during the inversion of the reference image. It then incorporates these features into the editing process using a Specific Reference Attention Layer (SR-attn) that blends features from the source and reference images based on attention masks, guiding the generation towards the reference while preserving the unedited parts of the source image.	SpecRef successfully edits images based on both text prompts and specific reference images. It effectively addresses the limitations of non-reference editing by allowing users to specify the desired appearance of edited objects or regions. Experiments demonstrate SpecRef's ability to perform various editing tasks like object replacement, clothing replacement, and scene replacement with promising results.	SpecRef may fail when the reference image region significantly differs in size or shape from the source image's editing region, leading to unnatural results. The reliance on cross-attention for transferring features from the reference to the edited image can cause issues when the spatial relationship between objects in both images is dissimilar.	aigc, image editing, diffusion models, stable diffusion, reference image editing
2401.03257 Report	RustNeRF: Robust Neural Radiance Field with Low-Quality Images	Mengfei Li, Ming Lu, Xiaofang Li, Shanghang Zhang	Recent work on Neural Radiance Fields (NeRF) exploits multi-view 3D consistency, achieving impressive results in 3D scene modeling and high-fidelity novel-view synthesis. However, there are limitations. First, existing methods assume enough high-quality images are available for training the NeRF model, ignoring real-world image degradation. Second, previous methods struggle with ambiguity in the training set due to unmodeled inconsistencies among different views. In this work, we present RustNeRF for real-world high-quality NeRF. To improve NeRF's robustness under real-world inputs, we train a 3D-aware preprocessing network that incorporates real-world degradation modeling. We propose a novel implicit multi-view guidance to address information loss during image degradation and restoration. Extensive experiments demonstrate RustNeRF's advantages over existing approaches under real-world degradation. The code will be released.	This paper introduces RustNeRF, a robust Neural Radiance Field (NeRF) framework designed to handle low-quality, degraded images for high-fidelity novel view synthesis.	Existing NeRF methods struggle with real-world image degradations, leading to unsatisfactory novel views. RustNeRF aims to improve the robustness of NeRF in real-world scenarios with degraded image sets.	RustNeRF utilizes a 3D-aware preprocessing network trained on a synthetic dataset with simulated real-world degradations. It employs a view selection mechanism to gather relevant information from neighboring views for restoring the target view. Additionally, it introduces an implicit multi-view guidance technique that casts multiple rays within a pixel to leverage information from different views, further enhancing details in the reconstructed scene.	RustNeRF demonstrates superior performance compared to baseline NeRF models, particularly DVGO and Instant-NGP, on benchmark datasets like Blender and LLFF, exhibiting significant improvements in PSNR, SSIM, and LPIPS metrics. The proposed 3D-aware restoration network effectively reduces artifacts and improves the overall quality of the reconstructed scene compared to using single-view restoration or off-the-shelf solutions like Real-ESRGAN. Implicit multi-view guidance, coupled with quadtree acceleration to manage computational cost, further enhances details and reduces noise in the rendered views, especially in regions with high-frequency information.	The current implementation of RustNeRF does not incorporate bundle adjustment to handle potential camera pose inaccuracies caused by degraded input images. The degradation model used for training the restoration network relies on a combination of classical degradation models and could benefit from further exploration and refinement to better simulate complex real-world degradation processes.	neural radiance fields, nerf, image restoration, novel view synthesis, 3d scene reconstruction
2401.03253 Report	VLLaVO: Mitigating Visual Gap through LLMs	Shuhao Chen, Yulong Zhang, Weisen Jiang, Jiangang Lu, Yu Zhang	Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To tackle this issue, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, disregarding the potential benefits of incorporating the text modality. In this work, we propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners. VLLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results under domain generalization and unsupervised domain adaptation settings demonstrate the effectiveness of the proposed method.	VLLaVO, a novel approach that integrates Vision Language Models (VLMs) and Large Language Models (LLMs) for addressing visual domain shifts in cross-domain learning.	Existing cross-domain learning methods often solely rely on image modality, neglecting the potential of text modality. This paper explores leveraging the power of LLMs for improved domain-invariant feature learning in visual tasks.	VLLaVO first utilizes VLMs to transform images into textual descriptions (tags, attributes, captions). Subsequently, an LLM is fine-tuned with a designed instruction template using these descriptions paired with image labels, enabling it to perform classification based on textual input.	VLLaVO consistently achieves state-of-the-art performance on benchmark datasets for both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) tasks. The method demonstrates superior zero-shot learning capability, outperforming zero-shot LLM baselines in cross-dataset evaluations. Analysis reveals that VLLaVO effectively learns domain-invariant features, as evidenced by t-SNE visualizations and sensitivity analysis, focusing on relevant keywords while mitigating domain-specific biases.	The quality of extracted textual descriptions depends on the VLM's capabilities and can be further improved. The current work focuses on visual classification, limiting its applicability to other visual tasks. Future research should explore extending VLLaVO to address domain shifts in tasks like segmentation or depth estimation.	domain generalization, unsupervised domain adaptation, large language models, vision language models, cross-domain learning
2401.03201 Report	3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding	Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu	The remarkable potential of multi-modal large language models (MLLMs) in comprehending both vision and language information has been widely acknowledged. However, the scarcity of 3D scenes-language pairs in comparison to their 2D counterparts, coupled with the inadequacy of existing approaches in understanding of 3D scenes by LLMs, poses a significant challenge. In response, we collect and construct an extensive dataset comprising 75K instruction-response pairs tailored for 3D scenes. This dataset addresses tasks related to 3D VQA, 3D grounding, and 3D conversation. To further enhance the integration of 3D spatial information into LLMs, we introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information including the entire scene and segmented objects. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain and find that our approach serves as a strategic means to enrich LLMs' comprehension of the 3D world. Our code is available at https://github.com/staymylove/3DMIT.	This paper proposes 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to train LLMs in understanding 3D scenes by leveraging global scene and fine-grained object information, without requiring an alignment stage.	Existing methods for 3D scene understanding with LLMs are limited by the scarcity of 3D scene-language data and the inefficiency of aligning 3D data with text.	The authors construct a 75K 3D scene-language instruction dataset and propose 3DMIT, which directly incorporates 3D scene and object features extracted by pre-trained encoders into the instruction prompt for LLM fine-tuning.	3DMIT outperforms 3D-LLMs without alignment and achieves comparable results to 3D-LLMs with alignment on 3D VQA. 3DMIT demonstrates promising performance on 3D visual grounding, outperforming methods without alignment. The ablation study shows the benefits of using pre-trained 3D object encoders and incorporating multi-view image tokens for MLLMs.	LLMs still face challenges in numerical and computational tasks, limiting their performance on tasks requiring precise spatial understanding. Further research is needed to explore how to effectively incorporate spatial location information into LLMs for improved 3D grounding.	multi-modal, 3d-llms, 3d scene understanding, instruction tuning, prompt engineering
2401.03140 Report	Fair Sampling in Diffusion Models through Switching Mechanism	Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saeroom Park	Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers. We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data.	This paper proposes "attribute switching," a sampling method for diffusion models that enhances fairness without requiring additional training or classifiers.	Diffusion models, while effective, can amplify biases present in training data. Existing fairness solutions often rely on classifiers or are computationally expensive. This method addresses these limitations by aiming for distributional fairness, ensuring generated data is independent of sensitive attributes like race or gender.	The method leverages the finding that diffusion models learn features at different sampling stages. It switches the sensitive attribute condition at a specific transition point during sampling, transferring high-level features from one attribute to another. A theoretical condition for finding this optimal transition point is provided and validated empirically.	The method successfully generates data satisfying epsilon-fairness, showing comparable performance to true data on fairness benchmarks. It preserves data utility, exhibiting similar FID scores to standard diffusion model sampling, indicating the generation of high-quality samples. The approach is effective across various pre-trained diffusion models, including those conditioned on text prompts, showcasing its versatility.	While the method preserves high-level image features, elements with strong contextual connections might be unintentionally removed, requiring further investigation. The study primarily focuses on distributional fairness. Exploring fairness in a broader context, considering various factors beyond distribution, is crucial for a holistic understanding.	diffusion models, fairness, generative models, sampling methods, attribute switching
2401.03048 Report	Latte: Latent Diffusion Transformer for Video Generation	Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao	We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.	This paper introduces Latte, a novel Latent Diffusion Transformer for video generation, featuring a video Transformer backbone and four efficient model variants for capturing spatio-temporal video distribution.	Generating high-quality videos is challenging due to their complex spatio-temporal information and high dimensionality. Latte explores the potential of Transformer-based latent diffusion models for realistic video generation.	Latte leverages a pre-trained VAE for encoding videos into latent space tokens, processed by Transformer blocks. Four variants explore decomposing spatial and temporal dimensions for efficient information capture. Extensive empirical analysis identifies best practices for Latte's components.	Latte achieves state-of-the-art performance on four video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. Comprehensive ablation studies reveal optimal design choices for Transformer-based video diffusion models, such as uniform frame patch embedding, S-AdaLN for timestep/class information injection, and absolute temporal positional embedding. Latte demonstrates promising results for text-to-video generation, comparable to existing methods like VideoFusion and VideoLDM.	Exploring the impact of different pre-trained video Transformers on Latte's performance. Investigating alternative methods for temporal information injection within the Transformer architecture.	video generation, diffusion models, transformers, latent space, text-to-video generation
2401.02957 Report	Denoising Vision Transformers	Jiawei Yang, Katie Z Luo, Jiefeng Li, Kilian Q Weinberger, Yonglong Tian, Yue Wang	We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.	Identifies and addresses the issue of noise artifacts in Vision Transformer (ViT) features, attributing them to positional embeddings and proposing a two-stage denoising approach called NDFT.	These artifacts hinder feature interpretability, disrupt semantic coherence, and negatively impact the performance of ViTs in downstream tasks.	A novel noise model decomposes ViT outputs into semantic and artifact components. A per-image denoising technique using neural fields extracts artifact-free features, and a generalizable denoiser network is trained for real-time inference.	Noise artifacts are prevalent in ViT features across various training algorithms. NDFT effectively removes artifacts, leading to visually cleaner feature maps and enhanced semantic coherence. Significant performance improvements are observed in downstream tasks like semantic segmentation and depth prediction after denoising.	The fundamental reason for the existence of these artifacts and their variation across training algorithms is not fully understood. Exploring alternative positional embedding approaches and ViT architectures could further mitigate artifacts.	vision transformers, feature denoising, positional embeddings, neural fields, dense prediction tasks
2401.02955 Report	Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively	Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy	The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the naive baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.	This paper introduces Open-Vocabulary SAM, a model that unifies the segmentation prowess of SAM with the zero-shot recognition capabilities of CLIP for simultaneous interactive segmentation and object recognition.	Existing methods for combining SAM and CLIP are computationally expensive and struggle to recognize small objects. Open-Vocabulary SAM aims to address these limitations with a unified architecture and knowledge transfer modules.	The paper proposes a unified encoder-decoder framework with two novel modules: SAM2CLIP (distills knowledge from SAM encoder to CLIP encoder) and CLIP2SAM (transfers CLIP knowledge to the SAM decoder for recognition).	Open-Vocabulary SAM outperforms naive baselines, achieving over 2% improvement in IoU and 3% in mAP on COCO. The method demonstrates significant gains in recognizing small objects, achieving over 20% accuracy improvement on LVIS. Trained on a large dataset, Open-Vocabulary SAM can segment and recognize approximately 22,000 classes, acting as an effective interactive annotation tool.	The model's performance slightly decreases when used with less robust detectors. Future work will explore using coarse masks or language descriptions as interactive prompts.	open-vocabulary learning, interactive segmentation, object recognition, vision foundation models, knowledge distillation
2401.02739 Report	Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors	Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov	We propose denoising diffusion variational inference (DDVI), an approximate inference algorithm for latent variable models which relies on diffusion models as flexible variational posteriors. Specifically, our method introduces an expressive class of approximate posteriors with auxiliary latent variables that perform diffusion in latent space by reversing a user-specified noising process. We fit these models by optimizing a lower bound on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. It increases the expressivity of flow-based methods via non-invertible deep recurrent architectures and avoids the instability of adversarial methods. We use DDVI on a motivating task in biology -- inferring latent ancestry from human genomes -- and we find that it outperforms strong baselines on the Thousand Genomes dataset.	The paper proposes Denoising Diffusion Variational Inference (DDVI), a new variational inference algorithm that uses diffusion models as flexible variational posteriors.	DDVI enhances variational inference by introducing auxiliary latent variables and a user-specified noising process, leading to more expressive approximate posteriors and tighter bounds on the marginal likelihood.	DDVI leverages a wake-sleep inspired lower bound, optimizing via alternating ELBO optimization and 'sleep' steps to reverse the noising process and fit the approximate posterior.	DDVI outperforms baselines like AEVB, AEVB-IAF, and AAEB in unsupervised learning tasks on MNIST and CIFAR-10 with various complex priors. In semi-supervised settings, DDVI achieves strong performance in classification accuracy and ELBO on both MNIST and CIFAR-10. Applied to genotype analysis on the 1000 Genomes dataset, DDVI demonstrates superior clustering performance compared to baselines.	While showing promise in dimensionality reduction and visualization, the paper acknowledges potential limitations in likelihood estimation compared to traditional methods. Further exploration of architectural improvements is needed to enhance performance in density estimation and sample quality tasks.	variational inference, diffusion models, expressive posteriors, wake-sleep algorithm, genotype analysis
2401.02677 Report	Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss	Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick Von Platen	Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments.	Introduces SSD-1B and Segmind-Vega, scaled-down variants of Stable Diffusion XL (SDXL) with 1.3B and 0.74B parameter UNets respectively, achieved through progressive layer removal and knowledge distillation.	Addresses the computational demands of SDXL, making it more accessible for wider reach and applicability in resource-constrained environments.	Employs progressive layer removal from the SDXL U-Net, guided by layer-level losses and knowledge distillation from multiple teacher models (SDXL base, Zavychroma-XL, Juggernaut-XL).	SSD-1B and Segmind-Vega achieve competitive image generation quality compared to the full SDXL model. Inference speedup of up to 60% for SSD-1B and 100% for Segmind-Vega. Human preference study shows SSD-1B is marginally preferred over SDXL despite its smaller size.	Limitations in generating specific image elements like text, hands, and full-body shots. Future work includes exploring the technique on other large models like LLMs and MLMs.	stable diffusion, sdxl, model compression, knowledge distillation, text-to-image synthesis
2401.02620 Report	Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human	Song Bai, Jie Li	While AI-generated text and 2D images continue to expand its territory, 3D generation has gradually emerged as a trend that cannot be ignored. Since the year 2023 an abundant amount of research papers has emerged in the domain of 3D generation. This growth encompasses not just the creation of 3D objects, but also the rapid development of 3D character and motion generation. Several key factors contribute to this progress. The enhanced fidelity in stable diffusion, coupled with control methods that ensure multi-view consistency, and realistic human models like SMPL-X, contribute synergistically to the production of 3D models with remarkable consistency and near-realistic appearances. The advancements in neural network-based 3D storing and rendering models, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have accelerated the efficiency and realism of neural rendered models. Furthermore, the multimodality capabilities of large language models have enabled language inputs to transcend into human motion outputs. This paper aims to provide a comprehensive overview and summary of the relevant papers published mostly during the latter half year of 2023. It will begin by discussing the AI generated object models in 3D, followed by the generated 3D human models, and finally, the generated 3D human motions, culminating in a conclusive summary and a vision for the future.	This paper presents a comprehensive overview of recent advancements in AI-powered 3D content generation, focusing on object and human model creation, and human motion synthesis.	The rapid progress in AI-generated 3D content is transforming various fields, including gaming, entertainment, and education, by enabling faster and more efficient creation of realistic 3D assets.	The paper reviews various techniques, including diffusion models, neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and large language models, highlighting their applications and limitations.	Recent models achieve high-fidelity 3D generation with resolutions up to 8K, and some methods can generate models in seconds. AI-powered human model generation leverages models like SMPL-X for realistic results, with advancements in both iterative and single-pass generation methods. Human motion synthesis has seen progress in generating complex movements from text and interacting with objects, though challenges remain in achieving perfect realism.	The lack of large, diverse 3D datasets compared to 2D image datasets limits the generalization capabilities of some models. Precise control and realism in human-object interaction animations are still areas for improvement.	aigc, generative ai, text-to-3d, 3d generation, metaverse
2401.02473 Report	VASE: Object-Centric Appearance and Shape Manipulation of Real Videos	Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, Nicu Sebe	Recently, several works tackled the video editing task fostered by the success of large-scale text-to-image generative models. However, most of these methods holistically edit the frame using the text, exploiting the prior given by foundation diffusion models and focusing on improving the temporal consistency across frames. In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities. Further details, code and examples are available on our project page: https://helia95.github.io/vase-website/	Introduces VASE, a framework for object-centric video editing that enables both appearance and structural modifications to objects in real videos using a single keyframe.	Existing video editing methods often lack the granularity for object-centric edits, struggle to capture precise nuances from text prompts, and rarely offer explicit control over object structure.	Leverages a pre-trained image-conditioned diffusion model with temporal layers, a ControlNet for motion and structure guidance, a Joint Flow-Structure Augmentation pipeline, a Flow-Completion Network, and an Auxiliary Segmentation Head.	Achieves high-quality appearance editing comparable to state-of-the-art methods. Demonstrates precise and user-controlled shape editing capabilities. Maintains temporal consistency while allowing for efficient editing without per-video training or complex video decomposition.	Performance can be affected by strong occlusions or significant perspective changes. Maintaining consistent edits in very long videos remains a challenge.	video editing, diffusion models, object-centric, shape editing, appearance editing
2401.02436 Report	Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis	Simon Niedermayr, Josef Stumpfegger, Rüdiger Westermann	Recently, high-fidelity scene reconstruction with an optimized 3D Gaussian splat representation has been introduced for novel view synthesis from sparse image sets. Making such representations suitable for applications like network streaming and rendering on low-power devices requires significantly reduced memory consumption as well as improved rendering efficiency. We propose a compressed 3D Gaussian splat representation that utilizes sensitivity-aware vector clustering with quantization-aware training to compress directional colors and Gaussian parameters. The learned codebooks have low bitrates and achieve a compression rate of up to $31\times$ on real-world scenes with only minimal degradation of visual quality. We demonstrate that the compressed splat representation can be efficiently rendered with hardware rasterization on lightweight GPUs at up to $4\times$ higher framerates than reported via an optimized GPU compute pipeline. Extensive experiments across multiple datasets demonstrate the robustness and rendering speed of the proposed approach.	This paper introduces a novel compression and rendering pipeline for 3D Gaussian splat representations used in novel view synthesis, significantly reducing memory consumption and improving rendering efficiency.	High-fidelity scene reconstruction methods often demand extensive memory, making them impractical for applications like network streaming and mobile rendering. This work addresses this limitation, enabling wider adoption of such representations.	The pipeline employs sensitivity-aware vector clustering to compress directional colors and Gaussian parameters into compact codebooks. Quantization-aware training refines the scene at reduced bit-rates, and entropy encoding exploits spatial coherence for further compression. Rendering is optimized using GPU sorting and hardware rasterization.	Achieves up to 31x compression on real-world scenes with minimal quality loss. Demonstrates up to 4x faster rendering speeds compared to prior compute pipeline approaches. Compressed scenes are suitable for low-power devices and network streaming applications.	Aggressively compressing Gaussian positions without significant error remains a challenge. Future work aims to reduce memory footprint during training and explore volumetric scene reconstruction.	novel view synthesis, 3d gaussian splatting, scene compression, quantization-aware training, gpu rasterization
2401.02418 Report	Learning to Prompt with Text Only Supervision for Vision-Language Models	Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari	Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.	ProText, a novel approach to adapt CLIP for downstream visual recognition tasks, leverages text-only supervision from Large Language Models (LLMs) to learn generalized and transferable prompts.	Existing methods for adapting CLIP either rely on labeled image data, which can be impractical, or employ class-specific LLM prompts that lack transferability to new classes and datasets. ProText addresses both limitations.	ProText curates text-to-text data from LLMs by pairing class-name templates with corresponding descriptions. It then trains learnable prompts to map these templates to rich contextual features aligned with LLM descriptions, effectively embedding LLM knowledge within the prompts.	In cross-dataset transfer, ProText outperforms CLIP and CuPL by +2.1% on average, demonstrating its generalization ability without using any visual samples. ProText surpasses prior image-supervised prompt learning methods in base-to-novel class generalization, achieving a higher average novel class accuracy of 76.98%. ProText consistently outperforms CuPL and WaffleCLIP in text-only supervised setting across diverse image datasets, indicating its effectiveness in utilizing LLM data for prompt learning.	The performance of ProText is dependent on the quality and size of LLM-generated text data, with potential for further improvement as text data quality increases. Exploring alternative techniques for contextual mapping, beyond prompt learning, could be a potential direction for future work.	prompt learning, clip, zero-shot learning, vision-language models, large language models
2401.02416 Report	ODIN: A Single Model for 2D and 3D Segmentation	Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki	State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).	This paper proposes ODIN, a novel model for 2D and 3D instance segmentation that effectively leverages pre-trained 2D backbones and operates directly on posed RGB-D images, achieving state-of-the-art performance in various benchmarks.	Existing 3D segmentation models typically rely on pre-processed 3D point clouds, limiting their applicability to real-world scenarios where raw sensor data is prevalent. This work bridges the gap between 2D and 3D perception by unifying them into a single model that can directly process sensor data.	ODIN alternates between 2D within-view fusion and 3D cross-view attention layers. It unprojects 2D features to 3D for cross-view contextualization and then projects them back to 2D. The model shares most of its parameters across 2D and 3D inputs, effectively leveraging pre-trained 2D backbones.	ODIN sets new state-of-the-art performance on ScanNet200, Matterport3D, and AI2THOR 3D instance segmentation benchmarks, outperforming previous methods that use mesh-sampled point clouds. The model also achieves competitive results on ScanNet and S3DIS benchmarks, demonstrating its effectiveness in handling real-world sensor data with misalignments. When used as the 3D perception engine in an instructable embodied agent architecture, ODIN sets a new state-of-the-art on the TEACh action-from-dialogue benchmark.	ODIN's performance depends on the accuracy of depth and camera pose estimations. Further research can explore scaling up 3D learning by jointly training on diverse 2D and 3D datasets for improved generalization.	3d instance segmentation, 2d-3d perception, rgb-d processing, embodied vision, transformer networks
2401.02414 Report	Bring Metric Functions into Diffusion Models	Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo	We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra metric functions remain unclear. One major challenge is the mismatch between the noise predicted by a DDPM at each step and the desired clean image that the metric function works well on. To address this problem, we propose Cas-DM, a network architecture that cascades two network modules to effectively apply metric functions to the diffusion model training. The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function. The second cascaded module learns to predict the clean image, thereby facilitating the metric function computation. Experiment results show that the proposed diffusion model backbone enables the effective use of the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on various established benchmarks.	This paper introduces a Cascaded Diffusion Model (CasDM) that enhances Denoising Diffusion Probabilistic Models (DDPMs) by incorporating additional metric functions, such as the LPIPS loss, during training.	Metric functions like LPIPS have shown significant improvements in consistency models, but their application to diffusion models remained unclear due to the challenge of aligning multi-step noise prediction with single-step metric computation.	CasDM employs two cascaded network modules. The first module predicts added noise like a standard DDPM. The second module refines the clean image prediction, facilitating effective metric function computation. This design isolates the noise prediction from the metric function's influence.	CasDM with LPIPS loss achieves state-of-the-art image quality (FID, sFID, IS) on various benchmarks. The architecture consistently improves performance across datasets, demonstrating the effectiveness of incorporating metric functions in diffusion models. The LPIPS loss enhances the diversity and distribution alignment of generated images, potentially due to its semantic awareness from the VGG backbone.	Exploration of more effective metric functions beyond LPIPS. Investigation into further architectural improvements for the clean image prediction module.	diffusion models, generative models, image generation, lpips loss, metric learning
2401.02402 Report	3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation	Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng	3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin.	This paper proposes the first method for 3D open-vocabulary panoptic segmentation, aiming to segment both unseen "things" and unseen "stuff" objects in autonomous driving scenarios.	Existing 3D panoptic segmentation models struggle to generalize to unseen object categories, limiting their real-world applicability in fields like autonomous driving.	The method fuses learned LiDAR features with frozen CLIP vision features, utilizing a single classification head for base and novel classes. Two novel distillation losses, object-level and voxel-level, improve classification performance on novel classes by leveraging CLIP's capabilities.	The method significantly outperforms the baseline on nuScenes and SemanticKITTI datasets. The voxel-level distillation loss is particularly effective for novel "stuff" categories. The fusion of LiDAR and CLIP features improves performance for novel "things" classes.	The model is evaluated on benchmarks with a limited number of categories, necessitating larger datasets for comprehensive evaluation. Future work could explore combining this method with approaches like RegionPLC for enhanced point-level discriminative features.	autonomous driving, 3d panoptic segmentation, open vocabulary, vision-language, clip
2401.02400 Report	Learning the 3D Fauna of the Web	Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, Jiajun Wu	Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.	This paper presents 3D-Fauna, a method that learns a pan-category deformable 3D animal model for more than 100 quadruped animal species jointly from 2D internet images.	Existing 3D animal reconstruction methods are limited to one or a few specific species. This work aims to achieve a more scalable solution by learning a single model for all animal species from readily available internet images.	The paper proposes the Semantic Bank of Skinned Models (SBSM), which learns a low-dimensional base shape bank using unsupervised image features and interpolates between them to model diverse shapes. It also introduces a mask discriminator to prevent viewpoint collapse.	The method reconstructs accurate articulated 3D animal shapes from single images across diverse species. Quantitative evaluations show significant improvements over existing single-category methods on keypoint transfer tasks. The Semantic Bank is shown to be crucial in preventing overfitting and capturing inter-species shape similarities.	The method is currently limited to quadruped animals with similar skeletal structures. Reconstructing accurate shapes for highly deformable animals, such as cats, remains challenging.	3d reconstruction, animal modeling, deformable models, single-view reconstruction, unsupervised learning
2401.02361 Report	An Open and Comprehensive Pipeline for Unified Object Grounding and Detection	Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang	Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.	This paper introduces MM-Grounding-DINO, an open-source and comprehensive pipeline for open-vocabulary object detection, phrase grounding, and referring expression comprehension built on the MMDetection toolbox.	Grounding-DINO, while achieving state-of-the-art performance in these tasks, lacks publicly available training code, limiting reproducibility and further research. This work aims to fill this gap.	The authors rebuilt Grounding-DINO using MMDetection, retaining the core architecture while adding a bias initialization to the contrastive embedding module. They pre-trained the model on a large dataset comprising COCO, Objects365, GRIT, V3Det, and referring expression datasets.	MM-Grounding-DINO-Tiny achieves superior zero-shot performance compared to Grounding-DINO-Tiny on COCO (50.6 mAP), LVIS (41.4 mAP), ODinW benchmarks, and comparable results on RefCOCO, gRefCOCO. The paper provides an extensive benchmark of results on OVD, PG, and REC tasks using a variety of datasets, offering a valuable resource for future research. Fine-tuning experiments demonstrate MM-Grounding-DINO's strong generalizability across various downstream tasks like object detection in haze, underwater, and in paintings.	The paper identifies limitations in the GRIT dataset, used as a substitute for the unavailable Cap4M, particularly the presence of noisy annotations and abstract phrases. Future work could explore more robust evaluation metrics for REC tasks and address the model's limitations in understanding relational terms and detailed descriptions.	open-vocabulary detection, phrase grounding, referring expression comprehension, mmdetection, zero-shot learning
2401.02347 Report	Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training	Longtian Qiu, Shan Ning, Xuming He	Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.	This paper proposes MacCap, a novel zero-shot image captioning framework that reduces the modality gap in CLIP's latent space by leveraging subregion image features and text-only training with noise injection and CLIP reranking.	Zero-shot image captioning with text-only training eliminates the need for expensive caption annotations and enables efficient development of vision-language applications, particularly for LLMs.	MacCap analyzes CLIP's latent space, revealing closer proximity of subregion image features to paired captions and a Gaussian distribution for the modality gap. It introduces region noise injection during training, subregion feature aggregation during inference, and a multiple sampling and filtering strategy with CLIP reranking.	MacCap outperforms previous zero-shot captioning methods in cross-domain and in-domain settings. Subregion feature aggregation effectively reduces the modality gap in CLIP. Noise injection and CLIP reranking further improve captioning quality.	The impact of sampling and filtering on semantic comprehension is limited. Further exploration is needed to apply MacCap to other vision-language tasks.	image captioning, zero-shot learning, clip, modality gap, vision-language models
2401.02330 Report	LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model	Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang	In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.	The paper introduces LLaVA-Phi, an efficient multi-modal assistant that leverages the compact language model Phi-2 for multi-modal dialogues, demonstrating the capabilities of smaller language models in visual-language tasks.	This work addresses the limitations of large vision-language models, such as high computational costs, by exploring the effectiveness of smaller, more efficient models for real-time applications on edge devices.	LLaVA-Phi is built by fine-tuning Phi-2 with a high-quality dataset and then training it with the LLaVA pipeline, which includes pre-training and visual instruction tuning.	LLaVA-Phi achieves performance comparable to or surpassing larger multi-modal models on various benchmarks, including VQA-v2, VizWizQA, and ScienceQA. It demonstrates strong generalization ability in handling complex questions, generating code from visual input, and solving mathematical problems. The model outperforms other efficient vision-language models like MobileVLM on multiple benchmarks.	The current architecture of LLaVA-Phi is limited to English instructions due to the codegen-mono tokenizer used by Phi-2. Future work will focus on exploring the impact of visual encoder size, refining training strategies (e.g., direct preference optimization, RLHF), and further reducing model size while maintaining or improving performance.	multi-modal learning, vision-language models, small language models, efficient ai, real-time applications
2401.02317 Report	BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model	Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma	In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM	This paper proposes BA-SAM, a Scalable Bias-Mode Attention Mask, to enhance the Segment Anything Model's (SAM) adaptability to varying image resolutions without structural modifications.	SAM, despite its zero-shot generalizability, suffers performance degradation with datasets of varying image sizes, limiting its application in downstream tasks.	The paper introduces: 1) a new scaling factor to maintain consistent magnitude in the attention layer's dot product values across varying token sequence lengths and 2) a bias-mode attention mask to prioritize neighboring information for each token, mitigating the impact of untrained distant information.	BA-SAM significantly mitigates performance degradation in zero-shot settings when inferring on higher resolutions. BA-SAM achieves state-of-the-art accuracy on various segmentation tasks with minimal fine-tuning. A proposed generalized BA-SAM model demonstrates strong generalizability across four datasets simultaneously.	The paper primarily focuses on enhancing SAM's performance on resolution variations, with other factors potentially influencing its performance. Future work could investigate extending BA-SAM to other vision transformer architectures beyond SAM.	segment anything model, resolution variation, attention mechanism, zero-shot learning, computer vision
2401.02281 Report	PEGASUS: Physically Enhanced Gaussian Splatting Simulation System for 6DOF Object Pose Dataset Generation	Lukas Meyer, Floris Erich, Yusuke Yoshiyasu, Marc Stamminger, Noriaki Ando, Yukiyasu Domae	We introduce Physically Enhanced Gaussian Splatting Simulation System (PEGASUS) for 6DOF object pose dataset generation, a versatile dataset generator based on 3D Gaussian Splatting. Environment and object representations can be easily obtained using commodity cameras to reconstruct with Gaussian Splatting. PEGASUS allows the composition of new scenes by merging the respective underlying Gaussian Splatting point cloud of an environment with one or multiple objects. Leveraging a physics engine enables the simulation of natural object placement within a scene through interaction between meshes extracted for the objects and the environment. Consequently, an extensive amount of new scenes - static or dynamic - can be created by combining different environments and objects. By rendering scenes from various perspectives, diverse data points such as RGB images, depth maps, semantic masks, and 6DoF object poses can be extracted. Our study demonstrates that training on data generated by PEGASUS enables pose estimation networks to successfully transfer from synthetic data to real-world data. Moreover, we introduce the Ramen dataset, comprising 30 Japanese cup noodle items. This dataset includes spherical scans that captures images from both object hemisphere and the Gaussian Splatting reconstruction, making them compatible with PEGASUS.	Introduces PEGASUS, a dataset generation tool for 6DoF object pose estimation that uses 3D Gaussian Splatting and physics simulations to create photorealistic scenes with accurate object poses.	Addresses the need for domain-specific datasets for robotics in service sectors, particularly for tasks like object pose estimation in convenience stores, where existing datasets are limited.	Combines 3D Gaussian Splatting reconstructions of environments and objects. Uses a physics engine (PyBullet) to realistically place objects within the environment. Renders novel views of the scene, extracting RGB images, depth maps, segmentation masks, and object poses.	Training on PEGASUS-generated data enables pose estimation networks (specifically DOPE) to successfully transfer from synthetic to real-world data. Introduces the 'Ramen' dataset, containing over 30 Japanese cup noodle products with spherical scans and 3D Gaussian Splatting reconstructions. Demonstrates the effectiveness of PEGASUS by successfully training DOPE for a real-world grasping task using a UR5 robot.	Lacks realistic shadow rendering; incorporating shadow maps or screen space ambient occlusion is planned. Scanning texture-less environments can lead to noisy Gaussian Splatting reconstructions, causing visual artifacts.	dataset generation, robotics, radiance fields, sim2real, gaussian splatting
2401.02142 Report	GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation	Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, Yang Wu	In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.	This paper presents GUESS (Gradually Enriching Synthesis), a novel cascaded diffusion-based framework for text-driven human motion generation.	Existing methods struggle with the large discrepancy between text and motion modalities. GUESS addresses this by mimicking the human brain's coarse-to-fine imagination process, progressively generating motion from abstract body part levels to detailed skeletons.	GUESS uses multi-scale skeletal representation to abstract human poses. It employs a variational autoencoder for motion embedding and a cascaded latent diffusion model for generating motion, guided by text descriptions and coarser motion guesses. It also introduces a dynamic multi-condition fusion mechanism to adaptively balance text and motion cues during generation.	GUESS significantly outperforms state-of-the-art methods on HumanML3D and KIT-ML datasets in terms of fidelity, text-motion consistency, and diversity. The proposed multi-scale and cascaded generation significantly reduces body-joint jittering and improves motion trajectory adherence to text descriptions. Dynamic multi-condition fusion effectively balances text and motion cues, leading to better generation quality.	The current multi-stage scheme uses a fixed number of inference stages, which could be made adaptive to different text inputs. Motion guess can be extended from spatial to temporal dimensions for generating sequences with increasing temporal resolution.	human motion synthesis, text-driven generation, cascaded diffusion model, multi-scale representation, dynamic multi-condition fusion
2401.02126 Report	Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance	Jiacheng Wang, Ping Liu, Wei Xu	Existing text-to-image editing methods tend to excel either in rigid or non-rigid editing but encounter challenges when combining both, resulting in misaligned outputs with the provided text prompts. In addition, integrating reference images for control remains challenging. To address these issues, we present a versatile image editing framework capable of executing both rigid and non-rigid edits, guided by either textual prompts or reference images. We leverage a dual-path injection scheme to handle diverse editing scenarios and introduce an integrated self-attention mechanism for fusion of appearance and structural information. To mitigate potential visual artifacts, we further employ latent fusion techniques to adjust intermediate latents. Compared to previous work, our approach represents a significant advance in achieving precise and versatile image editing. Comprehensive experiments validate the efficacy of our method, showcasing competitive or superior results in text-based editing and appearance transfer tasks, encompassing both rigid and non-rigid settings.	This paper proposes a versatile image editing framework capable of handling both rigid (e.g., color change) and non-rigid (e.g., shape change) edits, guided by either text prompts or reference images.	Existing text-to-image editing methods struggle to effectively perform both rigid and non-rigid edits simultaneously, often resulting in misaligned outputs or limited control. This new framework addresses these limitations and enables more versatile and precise image editing.	The framework leverages a dual-path injection scheme to handle different editing scenarios and introduces a unified self-attention mechanism to fuse appearance and structural information. Additionally, latent fusion techniques are employed to refine intermediate representations and mitigate visual artifacts.	The method achieves competitive or superior results compared to existing text-based editing methods, demonstrating improved alignment with target prompts and better handling of both rigid and non-rigid edits. It outperforms state-of-the-art appearance transfer methods, exhibiting superior preservation of both structural and appearance details. Ablation studies confirm the effectiveness of the proposed dual-path injection scheme, unified self-attention mechanism, and latent fusion techniques.	The method relies on pre-trained Stable Diffusion models, which might limit its generalizability to unseen domains or concepts. Further exploration is needed to improve fine-grained control over the degree of rigid and non-rigid transformations.	image editing, text-guided image manipulation, appearance transfer, diffusion models, self-attention
2401.02032 Report	DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection	Yunfan Ye, Kai Xu, Yuhang Huang, Renjiao Yi, Zhiping Cai	Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.	This paper proposes DiffusionEdge, the first diffusion model for edge detection, which generates accurate and crisp edge maps without post-processing.	Existing learning-based edge detectors struggle to achieve both correctness and crispness simultaneously, often relying on post-processing steps.	DiffusionEdge utilizes a decoupled diffusion architecture in latent space, employing an adaptive Fourier filter for frequency parsing and uncertainty distillation to maintain pixel-level uncertainty information from annotations.	DiffusionEdge achieves state-of-the-art performance on BSDS, NYUDv2, Multicue, and BIPED datasets. The model significantly outperforms other methods in crispness, as demonstrated by the Average Crispness (AC) metric. DiffusionEdge generates high-quality edge maps with minimal noise, even in challenging scenarios with complex backgrounds and textures.	The inference speed of DiffusionEdge can be further improved. Exploring the application of DiffusionEdge to downstream tasks in an end-to-end manner is a promising direction.	edge detection, diffusion model, computer vision, deep learning, image processing
2401.02015 Report	Improving Diffusion-Based Image Synthesis with Context Prediction	Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui	Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.	This paper proposes ConPreDiff, a novel method that enhances diffusion-based image synthesis by explicitly predicting neighborhood context.	Existing diffusion models primarily focus on point-wise reconstruction, neglecting the preservation of local context, which is crucial for generating high-fidelity images.	ConPreDiff introduces a context decoder to predict neighborhood distributions during training. This decoder is removed during inference, ensuring efficiency. An optimal transport loss based on the Wasserstein distance is employed to optimize the context prediction.	ConPreDiff achieves state-of-the-art results on text-to-image generation benchmarks, surpassing previous diffusion and non-diffusion models. The method significantly improves image inpainting performance across various mask distributions. ConPreDiff consistently enhances unconditional image synthesis, demonstrating superior perceptual quality and data distribution coverage.	Despite not adding inference parameters, ConPreDiff models have more trainable parameters than GANs. ConPreDiff inherits the long sampling times of diffusion models compared to single-step generative approaches.	diffusion models, image generation, context prediction, wasserstein distance, text-to-image synthesis
2401.01970 Report	FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding	Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li	Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of the same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection, despite that we are 851X faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the project page.	\algname{} embeds vision-language information from foundation models, such as CLIP, into a 3D Gaussian Splatting (GS) based scene representation for holistic 3D scene understanding.	Existing 3D scene understanding methods are limited to either geometric understanding or closed-set object detection. This work explores open-vocabulary 3D scene understanding by leveraging the success of vision-language foundation models.	The method distills CLIP and DINO embeddings into a multi-resolution hash encoding (MHE) field built upon 3D Gaussians generated by GS. A novel training procedure using a hybrid CLIP feature map and a pixel alignment loss ensures multi-view consistency and spatial accuracy.	Achieves state-of-the-art performance on open-vocabulary 3D object detection, surpassing previous methods by a significant margin. Demonstrates strong performance on semantic segmentation tasks, highlighting the quality of the learned feature embedding. Exhibits superior inference speed compared to NeRF-based methods, enabling real-time open-vocabulary queries.	Reliance on high-quality calibrated input images, limiting applicability in uncontrolled settings. Performance is limited by the quality of the base foundation models used for training.	3d gaussian splatting, vision-language embeddings, foundation models, open-vocabulary scene understanding, semantic segmentation
2401.01952 Report	Instruct-Imagen: Image Generation with Multi-modal Instruction	Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia	This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.	Introduces Instruct-Imagen, an image generation model that leverages multi-modal instructions to perform various visual generation tasks, generalizing to unseen and complex tasks.	Addresses the limitations of existing image generation models that often specialize in specific modalities and struggle with complex, multi-modal instructions.	Employs a two-stage training approach: 1) Retrieval-augmented training to enhance multi-modal context processing. 2) Multi-modal instruction-tuning on diverse image generation tasks paired with multi-modal instructions.	Achieves comparable or superior performance to task-specific models in in-domain evaluation. Demonstrates strong generalization ability in zero-shot settings, effectively handling unseen and complex multi-modal instructions. Outperforms baselines in instruction following and output quality, highlighting the importance of multi-modal instruction tuning.	Limited ability to handle image editing tasks in a zero-shot manner due to challenges in pixel-level consistency. Reliance on a cascaded diffusion model hinders access to high-resolution input details, leading to artifacts in generated images.	image generation, multi-modal learning, instruction tuning, zero-shot learning, diffusion models
2401.01862 Report	A Vision Check-up for Language Models	Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba	What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.	This paper investigates the visual knowledge acquired by Large Language Models (LLMs) through learning relationships between strings, particularly in generating and recognizing visual concepts using code as a proxy for images.	The work is important because it explores the potential of leveraging LLMs, trained solely on text data, to understand and represent the visual world, potentially opening new avenues for vision-related tasks.	The authors introduce a hierarchical dataset of visual concepts and evaluate LLMs on three tasks: (1) Generating code that renders visual concepts. (2) Recognizing visual concepts from code. (3) Improving generated code through text-based self-feedback. Additionally, they investigate if LLM-generated images can be used for training a vision system for natural images.	LLMs can generate code representing complex visual scenes, but struggle with details like texture and object interactions. LLMs struggle to recognize human-drawn images represented as code, indicating limitations in spatial reasoning and generalization beyond memorized prototypes. Images generated by LLMs can be used to train a vision system for natural images, achieving state-of-the-art performance when combined with datasets that offer textural diversity.	The study relies on code as an intermediary representation for images, which may not fully encapsulate the richness of the visual world. Future work can explore using larger and more diverse code datasets, as well as more complex feedback mechanisms to further improve LLMs' visual understanding.	large language models, visual knowledge, image generation, code representation, self-supervised learning
2401.01827 Report	Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions	David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo	Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.	This paper introduces LAVIN, a novel video generation model that leverages both image and text inputs for enhanced control over video generation.	Existing video diffusion models (VDMs) often lack control over visual appearance and geometric structure, relying primarily on text prompts which are insufficient for detailed visual descriptions.	The paper proposes a multimodal video block (MVB) incorporating decoupled cross-attention layers to simultaneously process image and text conditions. This design allows integration with pre-trained image ControlNet modules for geometric control without extra training.	LAVIN demonstrates superior performance in subject-customized video generation, outperforming text-only models and achieving strong zero-shot customization. The model excels in image animation, exhibiting better identity preservation, temporal consistency, and text alignment compared to existing methods. LAVIN shows promising results in video editing, effectively replacing subjects and incorporating text-guided elements while maintaining high temporal consistency.	The authors acknowledge the potential for generating harmful content and plan to implement safety measures like NSFW detectors before release. Future work includes exploring additional applications and refining the model for improved performance on complex video generation tasks.	video generation, diffusion models, multimodal conditioning, controlnet, image animation
2401.01808 Report	aMUSEd: An Open MUSE Reproduction	Suraj Patil, William Berman, Robin Rombach, Patrick von Platen	We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.	This paper presents aMUSEd, a lightweight, open-source masked image model (MIM) for text-to-image generation based on MUSE, focused on fast image generation.	The authors argue that MIM is underexplored compared to latent diffusion, despite advantages like fewer inference steps, interpretability, and single-image style transfer capability.	The paper introduces aMUSEd, an 800M parameter model utilizing a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone, trained on the LAION-2B dataset.	aMUSEd achieves superior inference speed compared to non-distilled diffusion models and is competitive with distilled few-step models. It demonstrates competitive CLIP scores but lags in FID and Inception scores compared to some state-of-the-art models. aMUSEd shows impressive results in zero-shot image variation, in-painting, single-image style transfer with StyleDrop, and is extended for video generation.	aMUSEd's FID and Inception scores are lower than some state-of-the-art models, indicating room for improvement in image quality. The exploration of interpretability for token prediction-based image models is suggested as a future research direction.	text-to-image generation, masked image modeling, open-source, fast inference, styledrop
2401.01730 Report	STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion	Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang	The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/	This paper introduces STAF, a novel Spatio-Temporal Alignment Fusion model for 3D human mesh recovery from videos, improving both accuracy and smoothness of the reconstruction.	Existing methods for 3D human mesh recovery from videos often prioritize either accuracy or temporal smoothness, leading to issues like mesh misalignment or jitter. STAF addresses this limitation by effectively leveraging spatial and temporal information.	STAF employs a multi-stage approach: 1) It uses a feature pyramid and an Average Pooling Module (APM) to capture global context and reduce dependence on individual frames. 2) A Temporal Coherence Fusion Module (TCFM) learns temporal dependencies from features extracted using grid sampling. 3) A Spatial Alignment Fusion Module (SAFM) refines the target frame's features by integrating information from adjacent frames using an attention mechanism based on initial mesh projections.	STAF achieves state-of-the-art accuracy on 3DPW and MPII3D benchmarks, surpassing previous methods in key metrics like MPJPE and PVE. The model exhibits high smoothness, as indicated by low acceleration error, exceeding most video-based methods. STAF demonstrates strong generalization ability, achieving good performance even without in-domain training data on 3DPW.	The over-smoothing issue may arise in extreme cases where human pose changes abruptly. Using a shorter sequence length mitigates this to an extent. Future work could explore alternative methods to handle rapid pose transitions without sacrificing overall smoothness.	3d human mesh recovery, video analysis, temporal coherence, spatial alignment, deep learning
2401.01702 Report	Image Sculpting: Precise Object Editing with 3D Geometry Control	Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, Saining Xie	We present Image Sculpting, a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from existing methods, which are confined to 2D spaces and typically rely on textual instructions, leading to ambiguity and limited control. Image Sculpting converts 2D objects into 3D, enabling direct interaction with their 3D geometry. Post-editing, these objects are re-rendered into 2D, merging into the original image to produce high-fidelity results through a coarse-to-fine enhancement process. The framework supports precise, quantifiable, and physically-plausible editing options such as pose editing, rotation, translation, 3D composition, carving, and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.	This document provides guidelines for formatting author responses to paper reviews, limited to one page and intended to address factual errors or provide requested clarifications.	Standardizes author response format, ensuring reviewers can efficiently assess rebuttals within a concise format.	Details formatting requirements including length, column layout, font size, figure/table formatting, and citation style.	Author responses limited to one page. Content restricted to addressing errors or providing requested information, not adding new contributions. Strict formatting guidelines ensure readability and consistency.	One-page limit might be restrictive for complex rebuttals. Guidelines don't address handling disagreements on subjective matters.	author response, formatting guidelines, review process, academic publishing, latex template
2401.01651 Report	AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI	Fanda Fan, Chunjie Luo, Wanling Gao, Jianfeng Zhan	The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-dependent and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: https://www.benchcouncil.org/AIGCBench.	This paper introduces AIGCBench, a comprehensive and scalable benchmark for evaluating Image-to-Video (I2V) generation tasks.	Existing I2V benchmarks lack diverse, open-domain datasets and standardized evaluation metrics, hindering fair and comprehensive algorithm assessment.	AIGCBench uses real-world and generated image-text datasets and 11 metrics across four dimensions: control-video alignment, motion effects, temporal consistency, and video quality.	Closed-source projects (Pika, Gen2) outperform open-source ones (VideoCrafter, I2VGen-XL, SVD) in generating long, high-quality videos. Current I2V algorithms lack fine-grained control over generated content, limiting precise alignment with textual descriptions. AIGCBench's evaluation standard correlates well with human judgment, validating its effectiveness in assessing I2V algorithms.	Limited test cases (3950) due to slow inference speeds and closed-source projects. Inability to automatically evaluate fine-grained object motion alignment with text descriptions.	artificial intelligence generated content, video generation, image-to-video benchmark, diffusion model, multimodal ai
2401.01647 Report	SIGNeRF: Scene Integrated Generation for Neural Radiance Fields	Jan-Niklas Dihlmann, Andreas Engelhardt, Hendrik Lensch	Advances in image diffusion models have recently led to notable improvements in the generation of high-quality images. In combination with Neural Radiance Fields (NeRFs), they enabled new opportunities in 3D generation. However, most generative 3D approaches are object-centric and applying them to editing existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation. A new generative update strategy ensures 3D consistency across the edited images, without requiring iterative optimization. We find that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views. Based on these insights, we introduce a multi-view reference sheet of modified images. Our method updates an image collection consistently based on the reference sheet and refines the original NeRF with the newly generated image set in one go. By exploiting the depth conditioning mechanism of the image diffusion model, we gain fine control over the spatial location of the edit and enforce shape guidance by a selected region or an external mesh.	SIGNeRF: a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation using a reference-sheet-based assembly and a generative update strategy.	Simplifies and enhances control over generative NeRF editing, enabling more complex and realistic modifications compared to previous methods.	Utilizes ControlNet, a depth-conditioned image diffusion model, to generate a multi-view consistent reference sheet of edits. The reference sheet then guides the efficient update of the original NeRF dataset, resulting in a modified 3D scene.	Achieves superior object generation and editing within complex NeRF scenes with consistent lighting and textures. Offers precise control over object placement, orientation, size, and appearance using shape selection or proxy mesh guidance. Provides a preview of the edited scene with the reference sheet before generating the complete dataset, unlike existing methods.	Image downscaling for reference sheet generation can lead to loss of detail in the edits. Extended scene modifications are limited due to the focus on a central object in the reference sheet.	nerf, scene editing, 3d generation, image diffusion, controlnet
2401.01520 Report	S$^{2}$-DMs:Skip-Step Diffusion Models	Yixuan Wang, Shuangyin Li	Diffusion models have emerged as powerful generative tools, rivaling GANs in sample quality and mirroring the likelihood scores of autoregressive models. A subset of these models, exemplified by DDIMs, exhibit an inherent asymmetry: they are trained over $T$ steps but only sample from a subset of $T$ during generation. This selective sampling approach, though optimized for speed, inadvertently misses out on vital information from the unsampled steps, leading to potential compromises in sample quality. To address this issue, we present the S$^{2}$-DMs, which is a new training method by using an innovative $L_{skip}$, meticulously designed to reintegrate the information omitted during the selective sampling phase. The benefits of this approach are manifold: it notably enhances sample quality, is exceptionally simple to implement, requires minimal code modifications, and is flexible enough to be compatible with various sampling algorithms. On the CIFAR10 dataset, models trained using our algorithm showed an improvement of 3.27% to 14.06% over models trained with traditional methods across various sampling algorithms (DDIMs, PNDMs, DEIS) and different numbers of sampling steps (10, 20, ..., 1000). On the CELEBA dataset, the improvement ranged from 8.97% to 27.08%. Access to the code and additional resources is provided in the github.	This paper introduces Skip-Step Diffusion Models (S$^2$-DMs), a novel method to enhance the performance of diffusion models, particularly those employing accelerated sampling techniques like DDIMs.	Diffusion models often suffer from slow sampling speed. While methods like DDIMs accelerate this by skipping steps, they introduce a discrepancy between the step-by-step training and skip-step sampling, compromising sample quality. S$^2$-DMs addresses this asymmetry.	The core of S$^2$-DMs is the introduction of a novel 'skip-step loss' ($L_{skip}$) during training. This loss function encourages the model to learn from the information typically missed during skip-step sampling, thereby improving consistency.	S$^2$-DMs consistently outperforms baseline models like DDIMs, PNDMs, and DEIS in image generation tasks on CIFAR10 and CelebA datasets, achieving better FID scores with the same number of sampling steps. The integration of skip-step information leads to higher-quality samples, as demonstrated by visual comparisons. S$^2$-DMs generates sharper images with finer details compared to baselines. The method is highly efficient and easy to implement. It requires minimal modifications to the training process and doesn't alter the sampling algorithm, making it user-friendly.	The paper primarily focuses on image generation, and further exploration is needed to evaluate its applicability in other domains. Future work will investigate the optimal integration of skip-step information into ODEs and explore its potential in non-continuous spaces.	diffusion models, generative models, image generation, accelerated sampling, skip-step sampling
2401.01339 Report	Street Gaussians for Modeling Dynamic Urban Scenes	Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, Sida Peng	This paper aims to tackle the problem of modeling dynamic urban street scenes from monocular videos. Recent methods extend NeRF by incorporating tracked vehicle poses to animate vehicles, enabling photo-realistic view synthesis of dynamic urban street scenes. However, significant limitations are their slow training and rendering speed, coupled with the critical need for high precision in tracked vehicle poses. We introduce Street Gaussians, a new explicit scene representation that tackles all these limitations. Specifically, the dynamic urban street is represented as a set of point clouds equipped with semantic logits and 3D Gaussians, each associated with either a foreground vehicle or the background. To model the dynamics of foreground object vehicles, each object point cloud is optimized with optimizable tracked poses, along with a dynamic spherical harmonics model for the dynamic appearance. The explicit representation allows easy composition of object vehicles and background, which in turn allows for scene editing operations and rendering at 133 FPS (1066$\times$1600 resolution) within half an hour of training. The proposed method is evaluated on multiple challenging benchmarks, including KITTI and Waymo Open datasets. Experiments show that the proposed method consistently outperforms state-of-the-art methods across all datasets. Furthermore, the proposed representation delivers performance on par with that achieved using precise ground-truth poses, despite relying only on poses from an off-the-shelf tracker. The code is available at https://zju3dv.github.io/street_gaussians/.	This paper presents Street-Gaussians, a novel explicit scene representation for efficiently reconstructing dynamic 3D street scenes from monocular videos and rendering high-fidelity novel views in real-time.	Modeling dynamic 3D streets from images has many important applications, such as city simulation, autonomous driving, and gaming. Existing methods suffer from slow training and rendering speeds and rely heavily on accurate tracked vehicle poses.	Street-Gaussians represents the dynamic urban street as a set of point clouds equipped with semantic logits and 3D Gaussians, each associated with either a foreground vehicle or the background. To model the dynamics, each object point cloud is optimized with optimizable tracked poses, along with a dynamic spherical harmonics model for the dynamic appearance.	Street-Gaussians consistently outperforms state-of-the-art methods in terms of rendering quality on KITTI and Waymo Open datasets. The method achieves real-time rendering speed of 133 FPS at a resolution of 1066x1600. Street-Gaussians delivers performance on par with methods using precise ground-truth poses, despite relying only on poses from an off-the-shelf tracker.	The method is limited to reconstructing rigid dynamic scenes and cannot handle non-rigid dynamic objects like pedestrians. The performance is dependent on the recall rate of off-the-shelf trackers.	3d scene reconstruction, dynamic scene modeling, neural rendering, autonomous driving, point cloud representation
2401.01256 Report	VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM	Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei	The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.	Proposes VideoDrafter, a framework for generating content-consistent multi-scene videos from text prompts.	Most existing video generation methods focus on single-scene videos, leaving multi-scene generation with consistent content largely unexplored.	Utilizes a Large Language Model (LLM) to convert prompts into multi-scene scripts and generate descriptions for common entities. Employs a text-to-image model to generate reference images for these entities, ensuring consistency across scenes. Introduces two diffusion models: VideoDrafter-Img for generating scene-reference images based on prompts and entity references, and VideoDrafter-Vid for producing video clips based on scene-reference images, action descriptions, and camera movements.	VideoDrafter outperforms state-of-the-art video generation models in terms of visual quality (FID, FVD) and content consistency (Scene Consis.). The use of entity reference images significantly enhances the consistency of entities across scenes. Human evaluations confirm VideoDrafter's superiority in generating logically coherent and content-consistent multi-scene videos.	The performance of open-source LLMs in script generation can be unstable, demanding careful prompt engineering and output verification. The lack of optimization for Stable Diffusion on video frames might lead to suboptimal frame quality.	video generation, diffusion models, multi-scene video, content consistency, large language models
2401.01216 Report	Noise-NeRF: Hide Information in Neural Radiance Fields using Trainable Noise	Qinglong Huang, Yong Liao, Yanbin Hao, Pengyuan Zhou	Neural radiance fields (NeRF) have been proposed as an innovative 3D representation method. While attracting lots of attention, NeRF faces critical issues such as information confidentiality and security. Steganography is a technique used to embed information in another object as a means of protecting information security. Currently, there are few related studies on NeRF steganography, facing challenges in low steganography quality, model weight damage, and a limited amount of steganographic information. This paper proposes a novel NeRF steganography method based on trainable noise: Noise-NeRF. Furthermore, we propose the Adaptive Pixel Selection strategy and Pixel Perturbation strategy to improve the steganography quality and efficiency. The extensive experiments on open-source datasets show that Noise-NeRF provides state-of-the-art performances in both steganography quality and rendering quality, as well as effectiveness in super-resolution image steganography.	This paper proposes Noise-NeRF, a novel Neural Radiance Fields (NeRF) steganography method that embeds secret information using trainable noise without modifying the model weights, ensuring lossless steganography and preserving rendering quality.	NeRF steganography, crucial for information confidentiality and model copyright protection, faces challenges in steganography quality, model weight damage, and limited information volume. Noise-NeRF addresses these limitations.	Noise-NeRF introduces trainable noise to specific viewpoints, optimizing it iteratively using backpropagation to minimize the difference between the rendered steganographic image and the target. It employs Adaptive Pixel Selection and Pixel Perturbation strategies to enhance steganography quality and efficiency.	Noise-NeRF achieves state-of-the-art steganography quality, achieving over 98% similarity on multiple benchmark datasets. Unlike existing methods that modify model weights, Noise-NeRF maintains the original rendering quality of NeRF, ensuring lossless steganography. Noise-NeRF demonstrates effectiveness in super-resolution image steganography, successfully embedding 2K resolution images into NeRF scenes with high fidelity.	The current implementation of Noise-NeRF focuses on steganography for a single viewpoint; extending it to multiple viewpoints is a promising future direction. Investigating the robustness of Noise-NeRF against various attacks and developing countermeasures will further enhance its practical applicability.	neural radiance fields, steganography, implicit neural representation, information security, 3d reconstruction
2401.01207 Report	Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation	Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng	In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.	This paper introduces a novel multi-modal face generation framework that allows simultaneous control over identity, expression, and background, enabling fine-grained expression synthesis.	Current text-to-image models struggle to generate user-desired portraits that retain individual identity while exhibiting diverse expressions. This framework addresses this limitation by allowing for precise control over these aspects.	The framework leverages a novel diffusion model called DiffSFSR (Simultaneous Face Swapping and Reenactment) that takes a selfie photo (identity), a text prompt (background), and an expression label as input. It employs techniques like balancing identity and expression encoders, improved midpoint sampling, and explicit background conditioning for enhanced control and quality.	The framework achieves fine-grained expression synthesis, surpassing state-of-the-art text-to-image methods in generating 135 distinct expressions. DiffSFSR outperforms hybrid methods (combining separate face-swapping and reenactment techniques) in simultaneous face swapping and reenactment tasks. User studies confirm the framework's ability to generate high-fidelity portraits with high consistency in identity and expression, exceeding existing methods in realism and image quality.	The framework's expression synthesis relies on a dataset with potential inconsistencies between expression labels and actual images, which can lead to semantic mismatches. Ambiguity and overlap between certain expression labels pose a challenge for accurate and distinct synthesis.	face generation, diffusion models, expression synthesis, face swapping, face reenactment
2401.01173 Report	En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data	Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, Xuansong Xie	We present En3D, an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors, our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic, geometrically accurate and content-wise diverse 3D humans without relying on pre-existing 3D or 2D assets. To address this challenge, we introduce a meticulously crafted workflow that implements accurate physical modeling to learn the enhanced 3D generative model from synthetic 2D data. During inference, we integrate optimization modules to bridge the gap between realistic appearances and coarse 3D shapes. Specifically, En3D comprises three modules: a 3D generator that accurately models generalizable 3D humans with realistic appearance from synthesized balanced, diverse, and structured human images; a geometry sculptor that enhances shape quality using multi-view normal constraints for intricate human anatomy; and a texturing module that disentangles explicit texture maps with fidelity and editability, leveraging semantical UV partitioning and a differentiable rasterizer. Experimental results show that our approach significantly outperforms prior works in terms of image quality, geometry accuracy and content diversity. We also showcase the applicability of our generated avatars for animation and editing, as well as the scalability of our approach for content-style free adaptation.	Presents En3D, a zero-shot generative scheme for creating high-quality 3D human avatars from synthetic 2D data, eliminating the need for pre-existing 3D or 2D datasets.	Addresses limitations of previous methods that relied on scarce 3D datasets or limited 2D collections, resulting in avatars with limited realism, geometric accuracy, and content diversity.	Employs a three-module pipeline: 1) 3D generative modeling (3DGM) learns from synthetic 2D images with accurate physical parameters. 2) Geometric sculpting (GS) refines shapes using multi-view normal constraints. 3) Explicit texturing (ET) generates UV texture maps via semantic UV partitioning and a differentiable rasterizer.	Significantly outperforms prior art in generating realistic and diverse 3D humans with high-fidelity geometry. Demonstrates capabilities for avatar animation, texture editing, and content-style adaptation (e.g., generating portrait heads or Disney-style characters). Achieves state-of-the-art results in quantitative metrics such as FID, IS-360, and normal accuracy.	Limited detail in generated hands, sometimes requiring replacement with SMPL-X templates. Future work could explore higher-resolution synthesis and more complex garment types.	3d human generation, generative adversarial networks, text-to-3d, avatar animation, 3d shape and texture editing
2401.01130 Report	Joint Generative Modeling of Scene Graphs and Images via Diffusion Models	Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal	In this paper, we present a novel generative task: joint scene graph - image generation. While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation. Our task is challenging, requiring the generation of plausible scene graphs with heterogeneous attributes for nodes (objects) and edges (relations among objects), including continuous object bounding boxes and discrete object and relation categories. We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes. We explore various types of encodings for the categorical data, relaxing it into a continuous space. With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph. Additionally, we introduce an IoU regularization to enhance the empirical performance. Our model significantly outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, both on standard and newly introduced metrics that better capture the problem complexity. Moreover, we demonstrate the additional benefits of our model in two downstream applications: 1) excelling in a series of scene graph completion tasks, and 2) improving scene graph detection models by using extra training samples generated from DiffuseSG.	This paper introduces a novel task of joint scene graph and image generation and proposes \OurModel, a diffusion-based model, to generate plausible scene graphs with heterogeneous attributes including object bounding boxes, object categories, and relations.	Generating scene graphs is important as it enables efficient and interpretable control for image generation and can provide synthetic data to augment the training of scene graph prediction models, which traditionally rely on costly annotated data.	The authors employ a two-step approach: first, they train \OurModel to generate scene graphs by modeling the adjacency matrix and node/edge attributes in a continuous space using a graph transformer as the denoiser. Second, a pre-trained layout-to-image model generates images conditioned on the generated scene graphs.	\OurModel significantly outperforms existing methods in scene graph generation on Visual Genome and COCO-Stuff datasets based on standard and newly introduced metrics. The model shows promising results in scene graph completion tasks, demonstrating its capability to infer missing information. Using generated scene graph-image pairs as additional training data improves the performance of downstream scene graph detection models.	The current approach uses a two-step process for scene graph and image generation, which might limit the coherence between the generated outputs. Future work includes exploring a single unified model for joint generation, improving the handling of the tail relations in scene graphs, and extending the approach to more complex image generation tasks.	scene graph generation, image generation, diffusion models, graph transformers, generative models
2401.01128 Report	SSP: A Simple and Safe automatic Prompt engineering method towards realistic image synthesis on LVM	Weijin Cheng, Jianzhi Liu, Jiawen Deng, Fuji Ren	Recently, text-to-image (T2I) synthesis has undergone significant advancements, particularly with the emergence of Large Language Models (LLM) and their enhancement in Large Vision Models (LVM), greatly enhancing the instruction-following capabilities of traditional T2I models. Nevertheless, previous methods focus on improving generation quality but introduce unsafe factors into prompts. We explore that appending specific camera descriptions to prompts can enhance safety performance. Consequently, we propose a simple and safe prompt engineering method (SSP) to improve image generation quality by providing optimal camera descriptions. Specifically, we create a dataset from multi-datasets as original prompts. To select the optimal camera, we design an optimal camera matching approach and implement a classifier for original prompts capable of automatically matching. Appending camera descriptions to original prompts generates optimized prompts for further LVM image generation. Experiments demonstrate that SSP improves semantic consistency by an average of 16% compared to others and safety metrics by 48.9%.	This paper introduces SSP, a simple and safe prompt engineering method for Large Vision Models (LVMs) that enhances image generation quality and safety by appending optimal camera descriptions to original prompts.	Existing prompt engineering methods for LVMs often introduce randomness, which can alter the original semantics, introduce unsafe factors, and raise safety concerns. SSP addresses these issues by providing specific camera descriptions that improve image quality while maintaining safety.	The authors create a dataset of original prompts from multiple sources and manually select optimal cameras for different image categories based on FID and CLIP Score. They then fine-tune a BERT model to automatically match optimal camera descriptions to new prompts.	SSP improves semantic consistency by an average of 16% compared to other methods. SSP enhances safety metrics by 48.9% compared to baselines, demonstrating a significant reduction in unsafe content generation. Text feature analysis reveals that SSP effectively influences prompt text features, leading to more realistic and visually appealing images.	The evaluation of image authenticity solely relies on FID and lacks dedicated metrics. The study is limited by the accessibility of various LVMs, hindering broader comparisons with other models.	prompt engineering, large vision models, text-to-image synthesis, image generation, safety
2401.01117 Report	Q-Refine: A Perceptual Quality Refiner for AI-Generated Image	Chunyi Li, Haoning Wu, Zicheng Zhang, Hongkun Hao, Kaiwei Zhang, Lei Bai, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai	With the rapid evolution of the Text-to-Image (T2I) model in recent years, their unsatisfactory generation result has become a challenge. However, uniformly refining AI-Generated Images (AIGIs) of different qualities not only limited optimization capabilities for low-quality AIGIs but also brought negative optimization to high-quality AIGIs. To address this issue, a quality-award refiner named Q-Refine is proposed. Based on the preference of the Human Visual System (HVS), Q-Refine uses the Image Quality Assessment (IQA) metric to guide the refining process for the first time, and modify images of different qualities through three adaptive pipelines. Experimental shows that for mainstream T2I models, Q-Refine can perform effective optimization to AIGIs of different qualities. It can be a general refiner to optimize AIGIs from both fidelity and aesthetic quality levels, thus expanding the application of the T2I generation models.	Q-Refine, a novel quality-aware refiner for AI-Generated Images (AIGIs), is proposed. It leverages Image Quality Assessment (IQA) metrics to guide the refining process based on the Human Visual System (HVS) preferences.	Existing AIGI refiners lack quality awareness, leading to insufficient enhancement in low-quality regions and negative optimization in high-quality regions.	Q-Refine employs an IQA module to predict a quality map and utilizes three adaptive pipelines: Gaussian Noise for low-quality regions, Mask Inpainting for medium-quality regions, and Global Enhancement for high-quality regions.	Q-Refine outperforms existing refiners on mainstream AIGI quality databases, achieving state-of-the-art results in most quality metrics. It effectively refines AIGIs of different qualities, demonstrating versatility across low, medium, and high-quality regions. Q-Refine consistently improves AIGI quality without causing negative optimization, as evidenced by ablation studies.	The IQA module's computational complexity might affect the efficiency of the refining process. The selection of optimal thresholds for quality regions could be further investigated.	ai-generated content, image quality assessment, image restoration, text-to-image synthesis, perceptual quality
2401.01008 Report	Fast Inference Through The Reuse Of Attention Maps In Diffusion Models	Rosco Hunter, Łukasz Dudziak, Mohamed S. Abdelfattah, Abhinav Mehrotra, Sourav Bhattacharya, Hongkai Wen	Text-to-image diffusion models have demonstrated unprecedented abilities at flexible and realistic image synthesis. However, the iterative process required to produce a single image is costly and incurs a high latency, prompting researchers to further investigate its efficiency. Typically, improvements in latency have been achieved in two ways: (1) training smaller models through knowledge distillation (KD); and (2) adopting techniques from ODE-theory to facilitate larger step sizes. In contrast, we propose a training-free approach that does not alter the step-size of the sampler. Specifically, we find the repeated calculation of attention maps to be both costly and redundant; therefore, we propose a structured reuse of attention maps during sampling. Our initial reuse policy is motivated by rudimentary ODE-theory, which suggests that reuse is most suitable late in the sampling procedure. After noting a number of limitations in this theoretical approach, we empirically search for a better policy. Unlike methods that rely on KD, our reuse policies can easily be adapted to a variety of setups in a plug-and-play manner. Furthermore, when applied to Stable Diffusion-1.5, our reuse policies reduce latency with minimal repercussions on sample quality.	This paper introduces training-free reuse policies for attention maps in text-to-image diffusion models, reducing latency without retraining or increasing step size.	Diffusion models, despite impressive performance, suffer from high latency due to the iterative nature and computational cost of U-Net calls, hindering their real-time applicability.	The authors analyze attention map redundancy and propose two policies: HURRY, based on Lyapunov exponents suggesting late reuse, and PHAST, a refinement of HURRY through local search for optimal reuse steps.	PHAST and HURRY significantly outperform random attention reuse policies. These policies, at comparable latency, produce samples closer to a 20-step DDIM baseline than 13-step DDIM, indicating better fidelity. Evaluation on MS-COCO shows comparable CLIP-Score and FID to baselines, with marginally lower FID suggesting minor distributional distortion.	The assumption of binary step-wise policies, while empirically supported, might not be globally optimal. The memory-latency trade-off, while addressed with reduced precision caching, requires further investigation for memory-constrained systems.	diffusion models, text-to-image synthesis, latency reduction, attention mechanism, reuse policies
2401.00935 Report	Boundary Attention: Learning to Localize Boundaries under High Noise	Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, Todd Zickler	We present a differentiable model that infers explicit boundaries, including curves, corners and junctions, using a mechanism that we call boundary attention. Boundary attention is a boundary-aware local attention operation that, when applied densely and repeatedly, progressively refines a field of variables that specify an unrasterized description of the local boundary structure in every overlapping patch within an image. It operates in a bottom-up fashion, similar to classical methods for sub-pixel edge localization and edge-linking, but with a higher-dimensional description of local boundary structure, a notion of spatial consistency that is learned instead of designed, and a sequence of operations that is end-to-end differentiable. We train our model using simple synthetic data and then evaluate it using photographs that were captured under low-light conditions with variable amounts of noise. We find that our method generalizes to natural images corrupted by real sensor noise, and predicts consistent boundaries under increasingly noisy conditions where other state-of-the-art methods fail.	This work introduces Boundary Attention, a novel deep network model designed for robust boundary detection in images, particularly under significant noise.	Robust boundary detection is crucial for various computer vision tasks but remains challenging, especially in noisy conditions. Existing methods often struggle to balance detail preservation and noise suppression.	The model utilizes a novel iterative refinement approach. It operates locally and refines boundary estimates within spatial neighborhoods using learned geometric primitives (junctions) and adaptive attention mechanisms.	The model demonstrates state-of-the-art performance on established boundary detection benchmarks, particularly under high noise levels. It effectively leverages color information for boundary localization and grouping, even without relying on semantic understanding. The learned junction representation exhibits a spatially smooth manifold in the model's hidden state, allowing for intuitive interpolation and manipulation of boundary structures.	The model's reliance on local operations may limit its ability to incorporate global context for boundary detection in some cases. Future work includes exploring extensions for handling more complex boundary structures and incorporating semantic information for enhanced performance.	boundary detection, deep learning, iterative refinement, attention mechanisms, noise robustness
2401.00909 Report	Taming Mode Collapse in Score Distillation for Text-to-3D Generation	Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra	Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing the entropy term in the corresponding variational objective, which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets, thereby mitigating the Janus problem. Based on this new objective, we derive a new update rule for 3D score distillation, dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward, our extensive experiments successfully demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation.	This paper proposes Entropic Score Distillation (ESD), a method to address the view inconsistency ("Janus") problem in text-to-3D generation using score distillation.	Existing score distillation techniques for text-to-3D generation suffer from the "Janus" artifact where generated objects have multiple front faces. This is attributed to the mode collapse problem arising from the optimization degenerating to maximal likelihood seeking on each view independently.	ESD introduces entropy regularization to the score distillation objective, encouraging diversity among different views of the generated 3D assets. It is implemented by leveraging the Classifier-Free Guidance (CFG) trick upon variational score distillation, mixing conditional and unconditional scores during training.	ESD effectively mitigates the Janus problem, producing 3D objects with better view consistency. ESD improves 3D generation quality compared to baseline methods, as demonstrated by qualitative and quantitative evaluations including FID and CLIP score. The paper introduces Inception Quality (IQ) and Inception Variety (IV) metrics to numerically probe and evaluate model collapse and view diversity in text-to-3D generation.	ESD might still be susceptible to mode collapse when the target image distribution is highly concentrated on one mode. The applicability of ESD to multi-particle VSD or amortized text-to-3D training remains unexplored.	text-to-3d generation, score distillation, janus problem, mode collapse, entropy regularization
2401.00896 Report	TrailBlazer: Trajectory Control for Diffusion-Based Video Generation	Wan-Duo Kurt Ma, J. P. Lewis, W. Bastiaan Kleijn	Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.	TrailBlazer enhances diffusion-based text-to-video generation by enabling precise control over subject trajectories and appearance through simple bounding box and prompt keyframing.	Existing text-to-video methods lack fine-grained control over subject motion, relying on labor-intensive frame-by-frame guidance. TrailBlazer provides an intuitive, user-friendly interface for casual users to direct subject motion.	TrailBlazer leverages pre-trained video diffusion models (ZeroScope) and manipulates spatial and temporal attention maps during the denoising process based on user-defined bounding boxes and prompt keyframes. This guidance steers subject generation without requiring model training or optimization.	TrailBlazer achieves accurate subject trajectory control, even with complex paths and dynamic bounding box sizes. The method produces natural motion with emergent perspective effects and object orientation consistent with the specified trajectory. TrailBlazer enables subject morphing by interpolating prompt embeddings, facilitating smooth transitions between identities within a video clip.	TrailBlazer inherits limitations from the underlying diffusion model, including potential object deformations and challenges with multi-object generation. The method's performance relies on consistency between the prompt and the keyframed bounding box trajectory. Extreme motion or unrealistic paths may lead to artifacts.	text-to-video synthesis, diffusion models, motion control, trajectory guidance, subject morphing
2401.00877 Report	Improving the Stability of Diffusion Models for Content Consistent Super-Resolution	Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, Lei Zhang	The generative priors of pre-trained latent diffusion models have demonstrated great potential to enhance the perceptual quality of image super-resolution (SR) results. Unfortunately, the existing diffusion prior-based SR methods encounter a common problem, i.e., they tend to generate rather different outputs for the same low-resolution image with different noise samples. Such stochasticity is desired for text-to-image generation tasks but problematic for SR tasks, where the image contents are expected to be well preserved. To improve the stability of diffusion prior-based SR, we propose to employ the diffusion models to refine image structures, while employing the generative adversarial training to enhance image fine details. Specifically, we propose a non-uniform timestep learning strategy to train a compact diffusion network, which has high efficiency and stability to reproduce the image main structures, and finetune the pre-trained decoder of variational auto-encoder (VAE) by adversarial training for detail enhancement. Extensive experiments show that our proposed method, namely content consistent super-resolution (CCSR), can significantly reduce the stochasticity of diffusion prior-based SR, improving the content consistency of SR outputs and speeding up the image generation process. Codes and models can be found at {https://github.com/csslc/CCSR}.	This paper introduces CCSR, a novel approach for image super-resolution that enhances the stability of diffusion models.	Existing diffusion prior-based SR methods often produce inconsistent results with varying noise samples, hindering their reliability for preserving image content.	CCSR employs a two-stage framework: a diffusion stage with a non-uniform timestep sampling strategy to refine image structures, followed by adversarial training of the VAE decoder for detail enhancement.	CCSR significantly reduces stochasticity in SR outputs, improving content consistency. It achieves comparable or superior performance to state-of-the-art GAN-based and diffusion-based SR methods. CCSR exhibits faster inference speeds compared to many diffusion-based methods due to its efficient sampling strategy.	The paper primarily focuses on visual quality and stability, with limited exploration of fidelity-perceptual trade-offs. Future work could investigate the impact of different VAE decoder architectures and training strategies on performance.	image super-resolution, diffusion models, generative adversarial networks, content consistency, stability
2401.00869 Report	FlashVideo: A Framework for Swift Inference in Text-to-Video Generation	Bin Lei, le Chen, Caiwen Ding	In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$, significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a $\times9.17$ efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.	Introduces FlashVideo, a novel text-to-video generation framework leveraging the RetNet architecture for fast inference.	Existing video generation models, while advanced, suffer from slow inference times, especially for longer sequences. FlashVideo addresses this by significantly improving inference speed.	Adapts RetNet for video generation with tailored training and inference frameworks. Introduces Serial Number tokens to enhance inter-frame relationship learning. Employs a redundant-free frame interpolation method for efficiency.	Achieves a 9.17x speed improvement over traditional autoregressive transformer models. Demonstrates inference speeds comparable to BERT-based transformer models. Exhibits high-quality video generation capabilities, validated through quantitative metrics (FVD, PSNR, SSIM, LPIPS) and qualitative analysis.	Limited evaluation on high-resolution video generation. Further exploration of the trade-off between generation speed and quality.	video generation, text-to-video, retnet, frame interpolation, deep learning
2401.00847 Report	Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera	Jiye Lee, Hanbyul Joo	We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices, our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere, enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities, we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes, we propose an algorithm to track and update floor level changes to define head poses, coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios, including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.	This paper introduces a novel motion capture method using two smartwatches and a head-mounted camera, making motion capture accessible and affordable.	Current motion capture methods rely on expensive and cumbersome equipment, limiting data availability for research in human motion understanding and human-machine interaction.	The system leverages monocular SLAM for head pose estimation, utilizes a multi-stage Transformer network to regress full-body motion from IMU and head pose data, and employs a motion optimization module with visual cues for refining the captured motion.	Despite using only upper body sensors, the system achieves comparable or better performance than state-of-the-art methods relying on full-body IMU setups. The proposed floor level update algorithm enables accurate motion capture in expansive environments with varying ground levels. The motion optimization module effectively integrates visual cues from the head-mounted camera, enhancing motion capture quality, especially for subtle movements.	The system depends on off-the-shelf models (e.g., SLAM) which may fail in rare cases. The current method relies on a mean body shape model and could be improved by explicitly accounting for body shape variations.	motion capture, wearable sensors, egocentric vision, human motion analysis, transformer networks
2401.00834 Report	Deblurring 3D Gaussian Splatting	Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, Eunbyung Park	Recent studies in Radiance Fields have paved the robust way for novel view synthesis with their photorealistic rendering quality. Nevertheless, they usually employ neural networks and volumetric rendering, which are costly to train and impede their broad use in various real-time applications due to the lengthy rendering time. Lately 3D Gaussians splatting-based approach has been proposed to model the 3D scene, and it achieves remarkable visual quality while rendering the images in real-time. However, it suffers from severe degradation in the rendering quality if the training images are blurry. Blurriness commonly occurs due to the lens defocusing, object motion, and camera shake, and it inevitably intervenes in clean image acquisition. Several previous studies have attempted to render clean and sharp images from blurry input images using neural fields. The majority of those works, however, are designed only for volumetric rendering-based neural radiance fields and are not straightforwardly applicable to rasterization-based 3D Gaussian splatting methods. Thus, we propose a novel real-time deblurring framework, Deblurring 3D Gaussian Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the covariance of each 3D Gaussian to model the scene blurriness. While Deblurring 3D Gaussian Splatting can still enjoy real-time rendering, it can reconstruct fine and sharp details from blurry images. A variety of experiments have been conducted on the benchmark, and the results have revealed the effectiveness of our approach for deblurring. Qualitative results are available at https://benhenryl.github.io/Deblurring-3D-Gaussian-Splatting/	This paper presents Deblurring 3D-GS, the first real-time deblurring framework for 3D Gaussian Splatting (3D-GS), which modifies the covariance of each 3D Gaussian using a small MLP to model scene blurriness.	Existing neural radiance field methods for deblurring either rely on time-consuming volumetric rendering or address only specific types of blur, hindering real-time applications.	The method manipulates covariance matrices of 3D Gaussians during training to simulate blur, expanding dispersion for defocus blur and averaging shifted Gaussians for motion blur. At inference, it renders sharp images using unmodified Gaussians without MLP activation.	Achieves state-of-the-art or competitive rendering quality on real and synthetic datasets with defocus and motion blur. Significantly faster rendering speed (> 800 FPS) compared to existing deblurring NeRF models. Proposed techniques for densifying sparse point clouds and depth-based pruning enhance reconstruction of fine details, especially at far plane.	Extending existing NeRF deblurring methods to rasterization-based 3D-GS is not optimal. Exploring compatibility with other 3D scene representations beyond 3D Gaussians is a potential future direction.	neural radiance fields, deblurring, real-time rendering, 3d gaussian splatting, point cloud
2401.00825 Report	Sharp-NeRF: Grid-based Fast Deblurring Neural Radiance Fields Using Sharpness Prior	Byeonghyeon Lee, Howoong Lee, Usman Ali, Eunbyung Park	Neural Radiance Fields (NeRF) have shown remarkable performance in neural rendering-based novel view synthesis. However, NeRF suffers from severe visual quality degradation when the input images have been captured under imperfect conditions, such as poor illumination, defocus blurring, and lens aberrations. Especially, defocus blur is quite common in the images when they are normally captured using cameras. Although few recent studies have proposed to render sharp images of considerably high-quality, yet they still face many key challenges. In particular, those methods have employed a Multi-Layer Perceptron (MLP) based NeRF, which requires tremendous computational time. To overcome these shortcomings, this paper proposes a novel technique Sharp-NeRF -- a grid-based NeRF that renders clean and sharp images from the input blurry images within half an hour of training. To do so, we used several grid-based kernels to accurately model the sharpness/blurriness of the scene. The sharpness level of the pixels is computed to learn the spatially varying blur kernels. We have conducted experiments on the benchmarks consisting of blurry images and have evaluated full-reference and non-reference metrics. The qualitative and quantitative results have revealed that our approach renders the sharp novel views with vivid colors and fine details, and it has considerably faster training time than the previous works. Our project page is available at https://benhenryl.github.io/SharpNeRF/	This paper proposes Sharp-NeRF, a fast grid-based NeRF framework for rendering sharp images from blurry inputs using discrete learnable blur kernels and a sharpness prior.	Existing NeRF-based deblurring methods suffer from long training times due to their reliance on computationally expensive MLPs.	Sharp-NeRF leverages a decomposed-grid representation for neural fields and introduces discrete learnable kernels optimized directly without requiring additional networks. A sharpness prior based on pre-computed per-pixel sharpness levels guides the assignment of blur kernels to groups of pixels with similar blurriness. Random patch sampling further accelerates training by reducing the number of rendered rays.	Sharp-NeRF achieves comparable or better image quality compared to state-of-the-art deblurring NeRF models. Sharp-NeRF achieves significantly faster training times, completing training in under half an hour. The use of a sharpness prior and discrete learnable kernels are shown to be crucial for achieving high-quality deblurring results.	The current implementation of Sharp-NeRF is designed specifically for defocus blur and may not generalize well to other types of blur, such as motion blur. The sharpness prior is pre-computed and does not account for potential changes in blurriness during training.	neural radiance fields, deblurring, image restoration, grid-based representations, sharpness prior
2401.00736 Report	Diffusion Models, Image Super-Resolution And Everything: A Survey	Brian B. Moser, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Sebastian Palacio, Andreas Dengel	Diffusion Models (DMs) have disrupted the image Super-Resolution (SR) field and further closed the gap between image quality and human perceptual preferences. They are easy to train and can produce very high-quality samples that exceed the realism of those produced by previous generative methods. Despite their promising results, they also come with new challenges that need further research: high computational demands, comparability, lack of explainability, color shifts, and more. Unfortunately, entry into this field is overwhelming because of the abundance of publications. To address this, we provide a unified recount of the theoretical foundations underlying DMs applied to image SR and offer a detailed analysis that underscores the unique characteristics and methodologies within this domain, distinct from broader existing reviews in the field. This survey articulates a cohesive understanding of DM principles and explores current research avenues, including alternative input domains, conditioning techniques, guidance mechanisms, corruption spaces, and zero-shot learning approaches. By offering a detailed examination of the evolution and current trends in image SR through the lens of DMs, this survey sheds light on the existing challenges and charts potential future directions, aiming to inspire further innovation in this rapidly advancing area.	This paper presents a comprehensive survey of Diffusion Models (DMs) for image Super-Resolution (SR), summarizing their theoretical foundations and analyzing their unique characteristics within this domain.	DMs have shown groundbreaking potential in image SR, exceeding the realism of previous generative methods and challenging GAN-based approaches.	The paper discusses different types of DMs (DDPMs, SGMs, SDEs), their relationship to other generative models, and improvements like efficient sampling techniques. It further explores concrete realizations of DMs in SR, alternative input domains, conditioning and guidance strategies, and zero-shot learning approaches.	DMs, particularly DDPMs, have become a dominant force in image SR, demonstrating superior perceptual quality. Alternative input domains like latent space and wavelet domain offer computational advantages and enhance control over image features. Zero-shot SR methods using pre-trained DMs show promising results, enabling SR without prior image examples.	The computational cost of DMs remains a significant hurdle for wider adoption and practical applications. Further research is needed to develop standardized benchmarks and evaluation metrics specifically designed for comparing generative SR models like DMs.	image super-resolution, diffusion models, generative models, deep learning, computer vision
2401.00616 Report	GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields	Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang	In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.	GD$^2$-NeRF is a novel coarse-to-fine generative detail compensation framework that hierarchically incorporates GAN and pre-trained diffusion models into OG-NeRF for One-shot Novel View Synthesis (O-NVS).	Existing OG-NeRF methods for O-NVS, while inference-time finetuning-free, struggle with blurry outputs due to their reliance on limited information from reference images.	GD$^2$-NeRF consists of two stages: 1) One-stage Parallel Pipeline (OPP) injects a GAN model into the OG-NeRF pipeline to address blurriness using in-distribution priors, and 2) Diffusion-based 3D-consistent Enhancer (Diff3DE) leverages pre-trained image diffusion models to complement rich out-distribution details.	OPP effectively relieves blurriness while maintaining fidelity, achieving a good balance between sharpness and fidelity. Diff3DE further enhances details with out-distribution priors while ensuring 3D consistency. GD$^2$-NeRF significantly improves detail and consistency compared to previous OG-NeRF methods and Zero123-NVS on both synthetic and real-world datasets.	The denoising process in Diff3DE, like many diffusion-based methods, is computationally inefficient. Diff3DE primarily enhances existing details and may not correct significant geometry errors in the input.	one-shot novel view synthesis, generalizable neural radiance fields, 3d reconstruction, generative adversarial networks, diffusion models
2401.00604 Report	SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity	Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra	Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS and VSD can be interpreted as applications of various control variates to the Monte Carlo estimator of the distilled score. Motivated by this rethinking and based on Stein's identity, we propose a more general solution to reduce variance for score distillation, termed Stein Score Distillation (SSD). SSD incorporates control variates constructed by Stein identity, allowing for arbitrary baseline functions. This enables us to include flexible guidance priors and network architectures to explicitly optimize for variance reduction. In our experiments, the overall pipeline, dubbed SteinDreamer, is implemented by instantiating the control variate with a monocular depth estimator. The results suggest that SSD can effectively reduce the distillation variance and consistently improve visual quality for both object- and scene-level generation. Moreover, we demonstrate that SteinDreamer achieves faster convergence than existing methods due to more stable gradient updates.	This paper introduces Stein Score Distillation (SSD), a novel variance reduction approach for text-to-3D score distillation, enabling improved quality and faster convergence in 3D asset synthesis.	Score distillation methods suffer from high variance in gradient estimation due to noisy denoising and small batch sizes, leading to slow convergence and suboptimal 3D generation results.	SSD leverages Stein's identity to construct flexible control variates, incorporating arbitrary baseline functions (e.g., depth/normal estimators) to reduce variance in score distillation.	SSD, implemented as SteinDreamer, effectively reduces variance and improves visual quality in both object and scene-level 3D generation compared to DreamFusion and ProlificDreamer. SteinDreamer generates 3D assets with finer details, smoother geometry, and fewer artifacts like Janus and ghosting. The method accelerates convergence by 14%-22%, requiring fewer diffusion model calls to achieve text-aligned 3D results.	Excessive variance reduction may lead to loss of detail in background regions. Future work includes exploring alternative baseline functions to further enhance SSD's performance.	text-to-3d generation, score distillation, variance reduction, "steins method", diffusion models
2401.00551 Report	A Generalist FaceX via Learning Unified Facial Representation	Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai	This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Facial Omni-Representation Decomposing (FORD) for seamless manipulation of various facial components, microscopically decomposing the core aspects of most facial editing tasks. Furthermore, by leveraging the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training, we design Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer the SD-aware generation process by the efficient Facial Representation Controller (FRC). %Without any additional features, Our versatile FaceX achieves competitive performance compared to elaborate task-specific models on popular facial editing tasks. Full codes and models will be available at https://github.com/diffusion-facex/FaceX.	This paper introduces FaceX, the first unified generalist model for diverse facial editing tasks.	Existing facial editing methods are often task-specific and lack versatility. FaceX aims to address this limitation by providing a single model capable of performing various tasks like face swapping, reenactment, and attribute editing.	FaceX decomposes facial images into identity, intra-personal variation (motion, texture, hair), and environmental factors. It leverages a pre-trained Stable Diffusion model, guided by assembled facial representations through a novel Facial Representation Controller (FRC).	FaceX achieves competitive performance on popular tasks like face reenactment and swapping compared to task-specific methods. It demonstrates strong capabilities in head swapping, outperforming state-of-the-art methods in terms of image quality and efficiency. The model exhibits versatility by enabling progressive editing across different tasks and extending to animation and inpainting.	While offering a unified framework, FaceX may be slightly suboptimal for specific tasks compared to dedicated approaches. The paper acknowledges the potential for misuse and emphasizes the need for parallel development of forgery detection methods.	facial editing, diffusion models, generalist model, stable diffusion, facial representation learning
2401.00431 Report	Wild2Avatar: Rendering Humans Behind Occlusions	Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli	Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural rendering approach catered for occluded in-the-wild monocular videos. We propose occlusion-aware scene parameterization for decoupling the scene into three parts - occlusion, human, and background. Additionally, extensive objective functions are designed to help enforce the decoupling of the human from both the occlusion and the background and to ensure the completeness of the human model. We verify the effectiveness of our approach with experiments on in-the-wild videos.	This paper presents Wild2Avatar, a novel neural rendering method designed for generating high-fidelity 3D human avatars from in-the-wild monocular videos containing occlusions.	Existing human rendering methods struggle with real-world occlusions due to a lack of ground-truth supervision and limitations in handling occluded 3D points.	The method utilizes occlusion-aware scene parameterization to decouple the scene into three parts: occlusion, human, and background. It models each part with separate neural radiance fields and employs a combination of photometric, decomposition, occlusion decoupling, and geometry completeness losses for optimization.	Wild2Avatar effectively decouples occlusions from the human body, enabling complete and high-fidelity human renderings even in the presence of obstacles. The method demonstrates superior performance compared to state-of-the-art methods like Vid2Avatar, particularly in reconstructing occluded body parts and maintaining geometric consistency. Quantitative evaluations using metrics such as PSNR, IoU, and a novel LLM-based quality assessment confirm the effectiveness of Wild2Avatar in handling occlusions and generating high-quality renderings.	The method's reliance on accurate pose estimations can impact rendering quality, particularly for inaccurate priors. Rendering occlusions increases inference time, leading to a slower optimization process.	human rendering, neural radiance fields, occlusion handling, monocular video, scene decomposition
2401.00374 Report	EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling	Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black	We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available https://pantomatrix.github.io/EMAGE/	Introduces EMAGE, a framework for generating full-body human gestures from audio and masked gestures, and BEAT2, a new mesh-level holistic co-speech gesture dataset.	Addresses the limitations of existing datasets and models for generating realistic and expressive full-body co-speech gestures, aiming to improve coherence and cross-modal alignment between audio and motion.	Presents BEAT2, combining SMPL-X body with FLAME head parameters, and EMAGE, using masked body gesture priors and a Masked Audio Gesture Transformer to generate gestures from audio and masked gesture input.	EMAGE generates holistic gestures with state-of-the-art performance. EMAGE accepts predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. BEAT2 provides a standardized, high-quality 3D motion captured dataset for co-speech gesture generation.	EMAGE's performance may be influenced by the quality of the input audio and masked gestures. Future work could explore more sophisticated methods for fusing audio and gesture information, potentially leading to even more expressive and realistic results.	co-speech gesture generation, masked representation learning, holistic gesture dataset, smpl-x, flame
2401.00370 Report	UGPNet: Universal Generative Prior for Image Restoration	Hwayoon Lee, Kyoungkook Kang, Hyeongmin Lee, Seung-Hwan Baek, Sunghyun Cho	Recent image restoration methods can be broadly categorized into two classes: (1) regression methods that recover the rough structure of the original image without synthesizing high-frequency details and (2) generative methods that synthesize perceptually-realistic high-frequency details even though the resulting image deviates from the original structure of the input. While both directions have been extensively studied in isolation, merging their benefits with a single framework has been rarely studied. In this paper, we propose UGPNet, a universal image restoration framework that can effectively achieve the benefits of both approaches by simply adopting a pair of an existing regression model and a generative model. UGPNet first restores the image structure of a degraded input using a regression model and synthesizes a perceptually-realistic image with a generative model on top of the regressed output. UGPNet then combines the regressed output and the synthesized output, resulting in a final result that faithfully reconstructs the structure of the original image in addition to perceptually-realistic textures. Our extensive experiments on deblurring, denoising, and super-resolution demonstrate that UGPNet can successfully exploit both regression and generative methods for high-fidelity image restoration.	This paper presents UGPNet, a universal image restoration framework that combines the strengths of regression-based and generative prior-based restoration methods.	Existing methods either excel at recovering image structure (regression-based) or synthesizing realistic high-frequency details (generative prior-based), but not both. UGPNet aims to bridge this gap, enabling high-fidelity image restoration with realistic textures.	UGPNet leverages a three-module system: (1) a restoration module (flexible choice of network) recovers the original image structure, (2) a synthesis module (based on GAN inversion) generates high-frequency details, and (3) a fusion module combines the features from both modules to produce the final restored image.	UGPNet demonstrates the ability to flexibly integrate diverse regression networks. Compared to solely regression-based or generative prior-based methods, UGPNet achieves superior performance on deblurring, denoising, and super-resolution tasks. UGPNet shows robustness in restoring out-of-distribution images compared to generative prior-based methods.	UGPNet's performance depends on the accuracy of the chosen regression method. While achieving high fidelity, UGPNet's sharpness might be less pronounced than its backbone generative model (StyleGAN2).	image restoration, generative prior, deep learning, deblurring, denoising, super-resolution
2401.00254 Report	Morphing Tokens Draw Strong Masked Image Models	Taekyung Kim, Byeongho Heo, Dongyoon Han	Masked image modeling (MIM) is a promising option for training Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked token predictions, with targets tokenized from images or generated by pre-trained models such as vision-language models. While tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified discriminative representations. Our pilot study confirms that addressing spatial inconsistencies has the potential to enhance representation quality. Motivated by the findings, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets. DTM is compatible with various SSL frameworks; we showcase an improved MIM by employing DTM, barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K demonstrate the superiority of our methods compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm	This paper introduces Dynamic Token Morphing (DTM), a novel masked image modeling method for Vision Transformers that addresses the spatial inconsistency problem in token-level supervision.	Pre-trained models often generate spatially inconsistent token representations, which can disrupt representation learning and lead to suboptimal performance.	DTM dynamically aggregates contextually related tokens using bipartite matching to create diverse and highly contextualized target representations for masked image modeling.	DTM consistently improves fine-tuning accuracies across various SSL frameworks (MAE, BEiT v2, BYOL) and ViT scales (S/16, B/16, L/16). The method surpasses state-of-the-art performance on ImageNet-1K and ADE20K datasets, demonstrating its effectiveness for image classification and semantic segmentation. DTM enhances transferability and tuning robustness, as demonstrated by superior performance on iNaturalist and fine-grained visual classification datasets.	The paper primarily focuses on ViT architectures and does not explore the application of DTM to other vision models like CNNs. The study's computational limitations restricted the evaluation of DTM to ViT-L/16, leaving its performance on larger-scale models like ViT-G unexplored.	masked image modeling, self-supervised learning, vision transformers, token aggregation, spatial inconsistency
2401.00208 Report	Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models	Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai	Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency.	Presents Inpaint4DNeRF, a novel framework for text-guided generative inpainting of Neural Radiance Fields (NeRFs), enabling the replacement of existing objects with new, semantically relevant content while maintaining 3D and 4D consistency.	Addresses the limitations of current NeRF editing techniques that struggle to generate new content consistent with the existing background, bridging the gap between 2D image inpainting and 3D/4D scene manipulation.	Employs a three-stage approach: 1) pre-processes training images by inpainting a few seed views and propagating them to other views using stable diffusion, 2) fine-tunes the NeRF with iterative dataset update to enforce multiview consistency, and 3) extends to 4D by propagating the inpainted content temporally.	Generates novel 3D content within existing NeRFs that aligns with user-provided text prompts. Maintains multiview consistency, ensuring the generated object appears seamless from different viewpoints. Demonstrates potential for 4D dynamic NeRF inpainting by propagating edits across frames while maintaining temporal consistency.	Limited capacity to handle complex geometry generation with wide camera angles. Further improvement needed in consistency and temporal coherence for 4D inpainting.	generative inpainting, neural radiance fields, nerf editing, diffusion models, 4d dynamic nerfs
2401.00110 Report	Diffusion Model with Perceptual Loss	Shanchuan Lin, Xiao Yang	Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.	This paper proposes a novel self-perceptual objective for diffusion model training that leverages the model itself as a perceptual network to improve the realism of generated samples.	Diffusion models often produce unrealistic samples when trained with standard mean squared error loss. While classifier-free guidance methods have addressed this, they are limited to conditional generation and can negatively impact sample diversity.	The authors freeze a pre-trained diffusion model and use it as a perceptual loss network. During training, the online diffusion model predicts the denoised image and noise, which are then used to generate a reconstructed image at a random timestep. The perceptual loss is calculated by comparing the hidden features of the reconstructed and ground truth images from the frozen model.	The self-perceptual objective improves both the Fréchet Inception Distance (FID) and Inception Score (IS) compared to models trained solely with MSE loss. Qualitative results demonstrate enhanced sample quality with the proposed method, generating more realistic images than MSE alone. Unlike classifier-free guidance, the self-perceptual objective can be applied to unconditional image generation, leading to improvements in this domain.	While the self-perceptual objective improves sample quality, it does not yet surpass the performance of classifier-free guidance combined with MSE loss for text-to-image generation. Future work involves exploring the combination of the self-perceptual objective with other guidance techniques and investigating its application across various modalities beyond images.	diffusion models, perceptual loss, image generation, classifier-free guidance, self-supervision
2401.00094 Report	Generating Enhanced Negatives for Training Language-Based Object Detectors	Shiyu Zhao, Long Zhao, Vijay Kumar B. G, Yumin Suh, Dimitris N. Metaxas, Manmohan Chandraker, Samuel Schulter	The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative samples. However, the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast, we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks. Code is available at \url{https://github.com/xiaofeng94/Gen-Enhanced-Negs}.	This paper proposes a novel method to automatically generate relevant negative text descriptions and corresponding negative images to improve the training of language-based object detectors.	Negative samples are crucial for training discriminative models, and existing methods for generating negatives for language-based object detection are limited in relevance and scope. This method addresses the need for better negative samples in this field.	The methodology involves utilizing large language models (LLMs) to generate negative text descriptions through techniques like concept-foiling and recombination. Furthermore, text-to-image diffusion models are employed to create negative images based on the generated texts, incorporating noise mitigation strategies.	Adding the generated negative data during training consistently improves the performance of language-based object detectors on OmniLabel and D³ benchmarks. The analysis shows that LLM-generated negative texts are more diverse and capture more complex relationships than rule-based methods. Generated negative images, after filtering, provide a complementary training signal, further enhancing the accuracy of language-based object detection, especially on the OmniLabel benchmark.	The quality of generated negative images depends on the capabilities of current text-to-image generation models, which can still produce noisy or unrealistic outputs. The current approach focuses on generating negatives for individual object descriptions, and future work could explore generating negatives for a set of descriptions within an image.	language-based object detection, negative sample generation, large language models, text-to-image synthesis, computer vision
2401.00027 Report	Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring	Xin Gao, Tianheng Qiu, Xinyu Zhang, Hanlin Bai, Kang Liu, Xuan Huang, Hu Wei, Guoying Zhang, Huaping Liu	Coarse-to-fine schemes are widely used in traditional single-image motion deblur; however, in the context of deep learning, existing multi-scale algorithms not only require the use of complex modules for feature fusion of low-scale RGB images and deep semantics, but also manually generate low-resolution pairs of images that do not have sufficient confidence. In this work, we propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of algorithms based on a coarse-to-fine scheme. To alleviate restoration defects impacting detail information brought about by using a multi-scale architecture, we combine the characteristics of real-world blurring trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images. In conclusion, we propose a multi-scale network with a learnable discrete wavelet transform (MLWNet), which exhibits state-of-the-art performance on multiple real-world deblurred datasets, in terms of both subjective and objective quality as well as computational efficiency.	This paper introduces MLWNet, a novel single-input multi-output (SIMO) multi-scale network incorporating a learnable discrete wavelet transform (DWT) for superior motion deblurring in images.	Existing deep learning deblurring methods, particularly those using coarse-to-fine schemes, often suffer from high complexity, rely on unreliable manually downsampled images, and struggle to restore high-frequency details. MLWNet addresses these limitations, aiming for enhanced efficiency and detail restoration.	MLWNet employs a SIMO architecture, taking a single image as input and progressively generating sharper outputs at different scales. It features learnable wavelet transform nodes (LWNs) within its structure to effectively capture directional continuity and frequency features for improved detail restoration. The training incorporates a multi-scale loss and a wavelet loss to ensure both pixel-level accuracy and proper wavelet kernel learning.	MLWNet achieves state-of-the-art performance on real-world deblurring datasets (RealBlur, RSBlur) exceeding previous benchmarks in PSNR and SSIM. The method demonstrates superior detail restoration, particularly in low-light conditions, compared to competing algorithms. It exhibits strong generalization ability, evidenced by its performance on unseen real-world blurry images.	While excelling in realistic blur, MLWNet's performance on synthetic datasets doesn't reach the same level, potentially due to the nature of synthetic blur and its differences from real-world scenarios. Future exploration could focus on adapting the learnable DWT module for improved handling of noise and high-frequency artifacts in synthetic blur.	image deblurring, deep learning, multi-scale network, discrete wavelet transform, simo